-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtipuesearch_content.js
1 lines (1 loc) · 304 KB
/
tipuesearch_content.js
1
var tipuesearch = {"pages":[{"title":"My Journey: Flat Water Foiling Paddle Up","text":"I have been learning to flat water foil paddle-up the past six months. The internet makes it seem as if this is the first step to downwind foiling. My initial guess was that I only needed 3-5 sessions to be able to do it. After 30+ sessions I finally paddle-up the first time. 😲😲😲. It was not a moment of celebration. It was a huge relief! My effort was not in vain. My physical ability is not so far below average compared to any of these foiling youtubers. And at last, I do not have to tell my wife every time I come back from paddling: should I give up? There are some key lessons that I learned along the way that I wish someone could have told me. These are not prevailing theories from the various youtube videos I have watched. I have watched probably 100+ paddle up related videos. For the many months that I toiled in the uncertainty if I could ever get to flight, I had many doubts. I questioned my overall physical strength and flexibility. I asked all the questions one could ask about equipment. Is my board skinny and long enough? Does my front wing have a sufficient low end? Is my paddle blade big enough? Is my mast too long? Is my foil setup too far forward or too far back? Other than changing my downwind board, I tried almost all combinations of equipment variables that were within my control. I had doubts about my techniques. Do I hold the paddles too low or too high? Am I catching forward far enough? Is my stance too wide or too short? Should my back foot be placed directly on the mast, further back, or more forward? Do I shift my feet in the stages? Should I slap the board down with my front feet or emulate the foil pumping motion more? Do I do small pumps, or do I attempt to go high as soon as I feel the board starting to release? Should I paddle with or cross or against the current? Should I get a bit of downwind assist? The internet provided some answers to my questions. But really, I had to answer my own questions or be at peace with not having answers. I was not a total beginner to foiling or supping. I wing foil. I am not good at it, but I could ride the foil comfortably and complete 50% of my jibes. I sup-surfed on and off as well. I sup surfed 3-5 ft SF OB . I thought that my background was sufficient to allow me to learn the flat water paddle up, and it would become a fun thing to do when I don't have time to go surfing and there is no wind to wing 1 . I was not expecting it to become a journey. 2 . For much of my journey and during the session of first successful paddle-up, I was using this Naish kit: 2024 105 liter downwind board, S27 Carbon 35 mast 75cm, S27 HA stabilizer 280, S27 HA front wing 1800, and a paddle with 95 cm surface area blade. The motion of paddling up looks easy: paddle straight, paddle-and-pump, get on foil, and pump away. The whole process takes less than 5 seconds. This 5 second sequence of movement is one of the most challenging physical techniques I have ever attempted to learn. If I were to learn this again with the hindsight knowledge I have now, I would break down the sequence of movement into key stages. Tree Topology of the Technical Stages The technical stages form a topological tree. It is not possible to learn some technical stages unless you already have the prerequisite techniques. For example, I did not realize that I was not a proficient pump foiler when I was giving my 110% to attempt a successful full sequence. I foolishly believed that my background, with experiences in supping and foiling, should allow me to get at least one successful paddle-up after 3-5 sessions. For months, I was sprint paddling and kind of emulating the motion of slapping down the board. I was getting better at paddling and getting stronger. My planing speed was getting faster and faster 3 , but I was not any closer to a successful paddle-up. I had to use other foiling disciplines to help me learn the technical stages in isolation. A successful paddle-up requires the ability to pump with at least some proficiency. I was thinking that I would use a flat water discipline to help me learn pump foiling. That was a problem of chicken and egg. My wife had many laughs over my naivety. At some point in my journey, I realized that I had to go on some side quests. For example, I had to learn pumping first 4 . I chose dock-start as my side quest to accomplish that. I believe I could have done that with SUP foil surfing or e-foiling. The on-flight paddle-pump is an interesting stage that is not often talked about. This technique is critical to the transition from planing to flight. The paddling with the board being grabbed by water tension is a lot more stable than paddling with the board being out of the water. At one point, I was at a stage where I could generate good speed from paddle-and-pump while most of the board was on the water. Every time I started the on-foil transition, it was hard to maintain my acceleration. I realized that I had to isolate that stage. My solution is dock-starting while holding a paddle. My practice hack is to jump into foil pumping, slow down, touch down on the water, and paddle back up to foil pumping. This was relatively easy for me to learn in isolation. It took me 2-3 sessions to check off this technical stage. The table below shows what I consider to be entrypoints for skill acquisition. It should be noted that even though a foiling discipline might require a skill, that discipline is not necessarily a good way to acquire the said skill. For example, learning wing foiling or dock starting is problematic if someone has never ridden a foil. That is best done towing behind a boat or through e-foiling. Wing foil uses a lot of pumping, but it is much easier to learn that skill through dock start or foil surfing. Foil Riding Pump Foiling Paddle-Pump On-Flight Paddle-Pump Sprint Paddle Tow Foiling ✓ maybe Wing Foiling maybe maybe SUP Foil Surfing maybe ✓ ✓ ✓ ✓ E Foil ✓ ✓ ✓ ✓ Dock Start ✓ ✓ ✓ Side Quests and Skill Acquisition FAQ Q: How should I hold the paddle? Ans : I learned that there were two distinct phases of paddling: sprinting and paddle-pumping. During the sprint phase, my lower hand is closer to the paddle blade. It allows me to go for a faster cadence and draw on more powerful strokes. The key draw is that I squat lower. As I transition to pumping and engaging the foil, I need a more standup posture. My lower hand slides about 2 inches higher. This is a critical transition. I had noticed that without this transition, I could never pump the foil properly to accelerate. Q: How could I use the current or wind to assist a paddle-up? Ans : There are theories out there that going cross current or against the current could help to get on foil because the foil gets lift from the amount of water flowing through its front wing. I disagree with that approach. The most critical element is board speed relative to the water surface. The most energy-consuming part is getting the board to sufficient speed so that I can engage the pumping action to break the water tension. Once the pumping starts, the drag is reduced, and the foil pumping becomes the main power source. This is made obvious from a dock start. If I were to jump onto a board but do not immediately achieve flight, I only need a few paddles to get up on foil because the jumping gives me the initial speed necessary to break water tension. Whenever there is a bit of wind, having the wind on my back helps a lot with the paddle-up. I also find it easier to go with the current. It is much less effort to gain the initial speed into the pumping action to break water tension. Q: How important is using a dedicated flat water foil setup like the Axis PNG 1300? Ans : Skills are more important than dedicated high-end gear. That was one of my most painful decision points. When I was not able to paddle up, I had serious doubts about whether it was my skills or the gear's problem. I believe it is 80% skills. Gears could help on the margin, but gears still would not allow me to skip a technical skill stage. Q: What were some of the critical mistakes I made? Ans : Not properly foil pumping while planing. Pumping is more important than paddling because pumping generates more sustainable forward momentum. Foil pumping is not just slapping the board down to the water. That is not a good mental model. The slapping might feel like pumping but does not contribute to acceleration. The pumping has to engage the foil. It is the same technique as an on-flight foil pump. Pumping has to be performed well before flight. Pumping is not just to sustain flight; it gets me to flight. My back foot was too far back. I have a habit that I am still not sure if it is a necessity or a mistake. I do 2-4 strokes to get the board going, but then I tend to slide my back foot toward the tail for about half of a foot to get more lift to break water tension. It is possible that it is better to always keep the back foot in one place, but I still don't fully understand it. What I am certain of is that my back foot has to shift forward to be right on top of the mast or ahead of the mast to be able to pump the foil. If my back foot is stuck behind the mast, I never take flight. Holding the paddle too low. Similar to sliding my back foot, I shift my lower hand toward the top as I transition to pumping. I start holding the paddle lower to do my power strokes to get to maximum planing speed. That lasts about 4-6 strokes. As soon as I break the water tension with the pumps, I stand tall and upright. I slide my lower hand about 1-2 inches higher. When I fail to make these adjustments, I cannot engage the foil. I would end up paddling insanely hard and continue to pull powerful strokes, but the board still just sticks to the water surface. I would get completely exhausted from the sprint paddling. Q: What were some key aspects that simply require muscle memory? Ans : Muscle memory cannot be taught. It could only be learned through endless practice. I found that a few aspects that simply require time on the water: paddling straight and fast, foil pumping, and balance. Footnotes I would also like to do some of these downwind runs . ↩ . The podcast and YouTube discourse misled me about the degree of difficulty for a flat water paddle up. For example, these videos showed how people learn the paddle up ( A and B ). I suspect that these youtubers are semi-professional. They do multiple sessions a week all year. They are already really good at auxiliary disciplines such as sup foiling or dock starting. They probably also have access to all the kits and toys (e.g., boats, e-foil, jet skis, etc) to help them progress. They painted a much rosier picture than what I went through. These videos are more helpful to the true average joes ( A and B ). ↩ In hindsight, I was probably hitting an asymptote. ↩ I have this unverified theory that there is a version of paddle-up approach that is more advanced and efficient. It crucially focuses on pumping as the key source of power generation, and paddling is only auxiliary. I noticed that two instruction videos keened on this fact. Jeremy Rig and this Chinese instructor 紅燒 . They are able to paddle up with far fewer paddles and what appears to be less effort compared to other instructors. For example, Mr. Rig could gain flight in just 5 paddles. Insane! These two videos also lend evidence to the theory that proper pumping is much more important than paddling ( A and B ). ↩","tags":"misc, , foiling","url":"/2024-11-11-flat-water-paddle-up","loc":"/2024-11-11-flat-water-paddle-up"},{"title":"Video Generation Models","text":"Language and vision models are the two frontiers of AI research and product engineering. Language models are mature enough to be many people's first stop for information. The early products powered by LLMs have changed how many search and learn. For example, I have used ChatGPT and Perplexity more than Google Search in the past 3 months. Image models also power useful products. For example, the most popular photo or social medial editing tools incorporate elements powered by image foundation models, e.g. Photoshop Firefly and Midjourney. Chatting and image editing products have already achieved mainstream adoption, demonstrating that they are both futuristic and practical. Video models are still experimental. Since the unveiling of Sora in February this year, video models have been one of the competitive frontiers for top industrial AI labs 1 . Many of these teams put enormous resources and focus on video generation. Numerous companies release their models, some in the form of paid trials and some in the form of just announcements and sample prompt-video pairs. They are still not packaged as useful products yet. The products are just a thin proxy to the underlying models. AI powered video editing is still not available. In this blog post, I am going to use two examples to draw out the basics of what video models look like today. This minimum technical background will allow us to understand how most video generation model is constructed, trained, and used for inference. It would give us the ability to think through its relationship to the its adjacent cousins of language and image models. We could also better understand how video models might evolve in the coming 6-18 months. I picked MovieGen ([polyak2024moviegencastmedia]) and VideoPoet ( KYG+24 ) as the two representative models for a deep dive. MovieGen is a good case study because it is one of the current state-of-the-arts, and the only leading project that releases substantial details about its data, engineering, and design space. Its design choices should be representative of the approaches that everyone else is fast copying. VideoPoet represents the best of open-source models before Sora turned video generation into a completely closed-source field. VideoPoet was also interesting because it crucially does not use DiT. Its technology illustrates alternative design choices that the video model community has more or less abandoned in the last year. Model 1: Movie Gen Movie Gen is a suite of models that provides vision, audio, and editing capabilities. This post only looks at its video foundation model. I aim to explain model as an illustrative example of most of the contemporary video foundation models. I break the model down into a few key ingredients to make it easier to digest: vision encoder, patchify, classifier free guidance, and flow matching objective. I wrote a previous post about how neural networks are used in generative models. This model can be seen as another example within that framework. The encoded videos are then patchified to produce a 1D vector, analogous to a sequence of text tokens in the LLM world. The \"tokens\" are fed into a neural network. The conditional information (i.e. the prompt) is tokenized and fed into the same network. The neural network crunches these inputs, i.e the video and the text, and outputs a video of bounded length. Movie Gen Model The Temporal Autoencoder ( TAE ) TAE is a variation of VQ - VAE even though its input is video data. Let \\(X_0\\) be the pure noise video data. \\(X_T\\) is the uncorrupted video. It has the shape of \\(T' \\times H' \\times W' \\times 3\\) . \\(T'\\) is the temporal dimension, represented the length of the video in number of frames. Each frame is \\(H' \\times W'\\) . The purpose of TAE is to compress the input into something more manageable and computationally feasible, say \\(T \\times H \\times W \\times C\\) , where \\(C\\) is a convenient number, for example \\(16\\) . The encoder first convolutes temporally and then convolute spatially. This reduces the overall dimensionality. Each \"frame\" goes through an attention block, and then temporal attention is applied. At this point, there are \\(T \\times H \\times W \\times C\\) embeddings. Because of the convolution and attention, each embedding contains local and global information in both the temporal and spatial direction. After applying quantization, the video is represented by an element in \\(\\{1, ... K\\}^{T \\times H \\times W \\times C}\\) , where \\(K\\) is the size of the discrete latent space. Both reducing the size of \\(T, H, W\\) and dropping the embedding dimension greatly reduce the dimensionality. See my previous post and vdOVK18 for more details, and these concepts in the image space. The video space adds the time component. It is fundamentally the same, it just adds more accounting. Patchify Patchify is a deterministic operation that re-arranges the input into a 1D vector. A patch could be considered as a sub-video both in the temporal and spatial dimension. For example, one patch represent 5 frames, each frame is only the upper right corder. Say we use patch numbers \\(P_t\\) and \\(P_s\\) . A patch would be a video that has \\(P_t\\) frames with each frame only has \\(P_s \\times P_s\\) pixels. In our latent space, each patch \\(p \\in K^{P_t \\times P_s \\times P_s \\times C}\\) . We treat each patch as a \"token\". That is, each token has a dimensionality of \\(P_t \\times P_s \\times P_s \\times C\\) . At this point, the input video becomes a 1D vector. Because everything is flatten, we add the position embedding to retains its positional information. This vector is analogous to the text tokens in LLMs. This paper DBK+21 first uses this technique. Classifier Free Guidance ( CFG ) Another key component is how to guide the video generation based on some text prompts. It has become standard practice to use the approach of classifier free guidance. The text is tokenized by a pre-trained tokenizer. The tokens are fed through transformers and combined with the video input coming from the patchified vector. The key difference is that the prompt tokens is controlled by a conditioning variable that probabilistic turning on and off during training. There is also a guidance scale that controls how much this input weighs in the inference stage. See HS22 . Flow Matching Objective The neural network is trained with a flow matching objective. The combination of TAE , patchify, CFG , and other minor techniques largely describe how to convert the input (video + text) and feed them into a transformer base neural network. There are two key decisions to make about the neural network. One is the internal architecture and the other is the loss objective. As of the writing of this blog, all the top video models uses DiT, where the network backbone uses transformers and the loss objective is diffusion 2 . The flow matching objective is a generalization of diffusion. This sounds really fancy, but we just have to notes that the neural network crunches all the input as we described, and outputs a generated video \\(X\\) . We calculate the loss, \\begin{equation} \\mathscr{L}(X, t; \\theta) = \\mathbb{E}_{t, X_0, P} ||u_t(X, P, \\theta) - V_t(X|X_0)||^2 \\end{equation} where \\(P\\) is the conditioning information from the text prompt, and \\(\\theta\\) represents all the trainable parameters. It is key that \\(V_t(X|X_0)\\) is directly calculated given a specific probability path description. For the example of gaussian diffusion, \\(p_t(X|X_{t-1}) = \\mathscr{N}(X|\\mu_{t}(X_0), \\sigma_{t}(X_0))\\) . We can get an analytical form \\begin{equation} V_t(X|X_0)= \\frac{\\sigma'_t(X_0)}{\\sigma_t(X_0)}\\left[ X - \\mu_t(X_0) \\right] + \\mu'_t (X_0) \\end{equation} Note that \\(u_t\\) is the output of the neural network. \\(\\mathscr{L}\\) is easily calculated and we could train on data and update \\(\\theta\\) . Inference starts with a video of pure white noise, \\(X_0\\) . It is encoded by TAE . Using \\(u_t(\\theta)\\) and solving a ODE giving us a probability distribution like description in that we could sample \\(X_t\\) from \\(X_{t-1}\\) . For each time step, the noisy \\(X_t\\) get more details through this transformation. The probability path description coming from the flow objective is the same as diffusion. Note that the ODE is not written out here. See my previous post or LCBH+23 for more details. Model 2: VideoPoet I am doing a deep dive into VideoPoet here because this model was state-of-the-art just two years ago. It contrasts the recent models in crucial ways. It was pre- SORA . At the time of its release, video generation was mostly just research. VideoPoet was not DiT. It uses an autoregressive objective instead of diffusion. This contrast is important to keep in mind because even the most recent models still feel immature. More likely than not, we will see a few more breakthrough moments before visual models reach a similar level of maturity compared to language models. Having explained MovieGen, understanding VideoPoet is a lot easier. The model converts text, visual, and audio into a single token vector. Each modality has its dedicated tokenizer. The token sequence is fed into a language model, which is trained to predict the next token. The sampling process autoregressively generates a visual token sequence given any partial input sequence. The model decodes the sequence into a video. Tokenizers The inputs are broken down into text, visual, and sound tokenizer. There are 256 special tokens. These special tokens allow the text, visual, and sound tokens to be delimited. The text tokens are generated by the T5 encoder. MagvitV2 ( YCS+23 , YLG+24 ) is the visual encoder, having a vocab size of 262,144 tokens. The audio tokenizer is SoundsStream ( ZLO+21 ). All the tokens are concatenated into a single sequence. VideoPoet Sequence Layout We focus on a bit more details about its visual encoders MagvitV2. It is similar to VQ - VAE . It is different in two key aspects. First is that it aims for a joint image-video tokenization, and second it uses a lookup-free quantization ( LFQ ) technique to allows the quantization parameter to grow in size. To include temporal information, the encoder primarily use a causal 3D convolution technique. Causal relationship is maintained to ensure that future frame information do not leak. This is important because the encoder are designed to be used for an autoregressive objective. In VQ - VAE , each embedding \\(z \\in \\mathbb{R}^{d}\\) is replaced with \\(z \\in \\{1, ..., K\\}\\) . That is, each embedding is quantized and mapped into a single integer. One critical flaw is that as \\(K\\) increases, the encoder does not increase in performance. MagvitV2 encoder improves on this by introduce another mapping. It first maps an element in \\(\\mathbb{R}^{d}\\) into \\(\\{-1, 1\\}^{d}\\) . For each of the dimension, it just retains its signed information. The values gets summed up geometrically to obtain a single number \\begin{equation} q(z) = \\sum_{i=1} ^{d} 2^{i-1} \\mathbf{1}_{\\{ z_i > 0\\}}. \\end{equation} Autoregressive Objective The model is trained with the objective of predicting the next token. The token sequence uses a shared multimodal vocabulary to represent all the input information (text, audio, and visual). The input is fed through a transformer base language model. The model generates a token sequence in autoregressively. This resembles a prefix, decoder only language model almost exactly. The key difference is the token vocabulary. Design Space of Video Models Now that we have taken a close look at two of the representative video models. We are equipped with enough background knowledge to discuss what all these AI labs are experimenting and competing on. DiT and Architecture The most striking fact is that everyone settles on DiT. DiT means that the neural network uses transformers backbone and is trained on a diffusion objective. Before this year, U-Net base on CNNs has the architecture of choice of most image and video diffusion models. The primary reason is that transformers requires quadratic memory relative to input sequence length. But the work of scaling transformers benefit greatly from the research results in language modeling. The memory requirement issue is addressed through more compute resources, more efficient attention mechanisms (e.g. flash attention or windowed attention), and better compressions through encoders. Intuitively, transformers have proven to such a breakthrough architecture in the world LLM , it only makes sense that it should be a key ingredient for visual models, especially as these models scale up. The choice of diffusion over autoregressive objective is more peculiar. Autoregressive modeling is highly successful in language, but diffusion works better for visual. There are a few things I would like to observe. First, language is intuitive causal. Language is an artificial way to encode information that works well for human. We speak one word after another. Whether intentional or not, autoregressive language model captures this behavior well. However, a visual image is not inherently causal. All the pixels kind of just simultaneously exist. The process of breaking an image down into different sub-images and then piecing them together does not really make sense. I would think about how to realize an image if I were to open my eyes from a deep sleep. First, I just notice a vague outline of the world, and then more details fill in. I start to make sense of the key details that are relevant orient myself in the physical world. My brain does not create visual token after visual token to realize the full image. Furthermore, I actually do not notice all the details, but yet I have a global sense of the what is in front of me. This resembles the diffusion process. In some ways, both autoregression and diffusion are about starting from nothing to building and building more details to complete a bigger objective. In the case of language, it is word after word. In the case of visualization, it is overall details. This blog also makes the connection between autoregression and diffusion. It breaks visual content into a spectral space, and shows that each diffusion step, it is adding more details in different spectra. That is, next token is autoregressive on tokens, and diffusion is autoregressive on spectra. It is interesting to notice that video does not benefit from a causal rendering that similar to language. For each of the frame or adjacent frame, I suspect it is diffusion is a superior generating process compared to causal rendering like autoregressive. But between scenes, I kind of suspect that some kind of causal objective is more appropriate. Is there a way build a video model that allows visual rendering to be mostly coming from diffusion but temporal content to be driven by some kind of causal objective. Someone would have to make this breakthrough. We might see this in the future. I don't think diffusion objective is sufficient for generating high fidelity, longer form videos. Encoders VQ - VAE was a key development demonstrating the power of learned visual representation. It is not surprising that tokenizing text is a more or less a solved problem because text itself is already a well formed abstraction of the physical world. For the example of Byte Pair Encoding, it merely just learns a fixed number, say 30,000, vocabulary from a corpus. That is, its algorithm essentially picks 30,000 subwords. Any new text will be converted into those subwords. On the other hand, visual data is much more raw. An image is composed of pixels, which could be roughly presented by 3 small integer. There just isn't any natural way to turn them into discrete variables. In order to convert visual information into low dimensional, discrete chunk, the encoder has to be highly intelligent. That means the conversion process itself must be powered by a large neural network. Every video model actively experiments with the encoder design. The key characteristics is compression rate and representation power. Compression is important because video input consumes a lot of memory. Even a 10 second with a frame rate of 16 fps is too much to feed directly into a DiT network. Representation power could be understood as reconstruction loss. There is a natural tradeoff between compression and representation power. This is a classic problem in information theory. In the ideal world where we have limitless computing power and memory, we would not need to throw away information through lossy encoding. But visual information is still too much to be handled directly in its entirety by large deep network. Encoder design is likely to be an unsolved problem. Encoders architecture choices are relatively well known. Visual data are usually convoluted in the spatial and temporal dimension to reduce dimensionality. They are further fed into some attention system. We can also add any other neural network structure to the embedding afterward. Dimensionality could be further reduced through some of quantization. These architectural choices are somewhat similar to feature engineering. For each of these components, they have learnable parameters. The vast majority of the intelligence of the encoders is learned from data. Most of the encoders are still relatively small compared to the main diffusion component. The choice of encoders still has a large impacts on overall model performance. I have not seen anyone really pushes scaling boundary of encoders. Encoders are still less than 1 billion parameters. It might be worthwhile to build a 10 to 50 billion model just for encoding and decoding. The key reason is that the most advanced video models are only in the 10-50 billion parameters range. Visual models haven't demonstrated the same commercial value compared to language models. Visual models are still early in the product development cycle. Scaling Estimate The state of the art video models are about 50 billion parameters. GPT4 was already reported in the trillion-parameters range, and that was a year ago. The most advanced LLMs are probably in the 10 trillion range. I take the position that visual system is more complex than language system. A video communicates more information about the physical or abstract concept than language model. Furthermore, if a visual system also generates audio, it is without a doubt that visual system is strictly more capable than language system. Visual model is more computationally demanding than LLMs. Even a relatively low quality, encoded, 60 seconds video will be 20 MB . An averaged news article is about 1000 words. 1 MB is probably enough to encode 50 articles. One 60 seconds is equivalent to a thousand articles. It is not possible apply self attention naive to a video input. Video model requires a lot more clever techniques to reduce its computation complexity, but it is also fundamentally much more constrained compared to language model. The two facts that visual model is fundamentally more complex and it is 2 orders of magnitude smaller than language model do not add up. It means only one thing. Visual system is no where near in maturity compared to LLMs. If we rate LLMs as having the reasoning ability of an undergraduate, visual model is more like elementary school. Even if the generated models might look impressive, its internal model of the world is not nearly as sophisticated. Vision Data Movie Gen uses billions of images and 100s of millions of video clips. Let's be more concrete. LAION 2B dataset is roughly 500TB . Merlot reserve is about 500TB of data and could be roughly cut into 500 million clips. Just as a point of comparison, gpt4 is reported trained on roughly 15 billion tokens, which is roughly 50 TB of text data. I am sure that Movie Gen uses some of these datasets in additional to other datasets. The combination of LAION 2B and Merlot Reserve are quite representative and in the right order of magnitude in terms of training data for the best of the video foundation models today. I have written my previous post of open source vision dataset that LAION and CommonPool dataset probably already reach the upper limit of what image data the internet can provides. We have much more video data than the typical billion clips range. We should be able to see much better video models in the coming years even just on the expectation that there is a lot of room to grow on video training data. We could even estimate what is required to create a clean, training-ready video dataset that is in the 10 billion clip range. Say the data team is given 10,000 GPU . It still takes 10 days to just do one pass on the data that just do a simple encoding/decoding. It might sound hard, but with good enough engineering, this compute infrastructure should be enough. While the GPU could be amortized over the years to do ther work, roughly speaking, it is still a 50 million if we assume GPU unit cost of $500. Other infra cost includes data storage, CPU server, and data transfer. Data acquisition will likely incur direct payments in the million range, if not 10s of millions. We can say we have a 10 person engineering team; while expensive, it is not as much as infra. I would estimate the total cost of that dataset in the range of 100 million dollars. That is expensive! But a lot of teams will pay that kind of cost to develop these models. It should be noted that this cost is just for creating one next-generation video dataset. Training will cost much more. Judging the details different teams have release, it seems that youtube videos are still make up of the majority of the training data. All the research projects are working hard to get alternative sources. Their focus is likely to be dependent on what they want their models to focus on. Video datasets are more challenging to work with compared to text and even images. They are much bigger. They are not just expensive but also much more demanding on the engineering system to do data transfer and process. They have to be encoded at rest because raw frames are simply too large to store. They require GPUs to to speed up even basic processing such scene cutting, filtering, and captioning. A lot of video data are better protected and less generally available on the internet. Even if some research teams want to be loosey-goosey with compliance, it is still hard for them to get their hands on the vast majority of content from YouTube, movie archives, sports recordings, broadcast studios, and game replays. Texts and images datasets are much easier to create, and at this point, almost all the major industry AI research labs probably end up with more or less the same datasets. It will be interesting to see how video datasets play out in the next few years. Models Comparison I compiled key metrics about some well-known models 3 . Model Announced Org Openness Product Trial Imagen Video 2022-10 Google paper no Runway Gen-2 2023-04 Runway closed paid AnimateDiff 2023-06 Tsinghua open public Lumiere 2023-08 Google paper invite only ModelScope 2023-08 Alibaba open public Stable Video 2023-11 Stability AI paper yes Emu Video 2023-11 Meta paper no VideoPoet 2023-12 Google paper no WALT 2023-12 Google paper no Sora 2024-02 OpenAI technical report no Veo 2024-05 Google closed invite only Runway Gen-3 2024-06 Runway closed paid Open-Sora 2024-06 HPCAI Tech open public Kling 2024-06 Kuaishou closed paid PixelDance 2024-09 Bytedance closed invite only Seaweed 2024-09 Bytedance closed invite only CogVideo 2024-09 Tsinghua open public Movie Gen 2024-10 Meta paper no Product Attributes Model Size Data Architecture Training Objective Imagen Video 5.7 billion 60M image 14M video U-Net diffusion Runway Gen-2 unknown unknown unknown unknown AnimateDiff 1 billion 10M video adapters diffusion Lumiere unknown 30M video U-Net diffusion ModelScope 1.7 billion 2B image 10M video U-Net diffusion Stable Video 1.5 billion 600M video transformer diffusion Emu Video 4.4 billion 34M video U-Net image diffusion next frame VideoPoet 8 billion 1B image 270M vidoes MAGVIT -v2 transformer next token WALT 0.5 billion 970M image 89M video MAGVIT -v2 transformer diffusion Sora unknown unkonwn transformer diffusion Veo unknown unknown unknown unknown Runway Gen-3 unknown unknown unknown unknown Open-Sora 1.5 billion 90M videos VAE transformer diffusion Kling unknown unknown unknown unknown PixelDance unknown unknown unknown unknown Seaweed unknown unknown unknown unknown CogVideo 5 billion 2B image 35M video VAE expert transformer diffusion Movie Gen 30 billion O(1)B image O(100)M video TAE transformer diffusion Model Attributes The announcement of Sora was a watershed moment. Before Sora, video models were driven mostly by the research community. When Sora was made public in February, it only released minimal technical details and a limited set of selected prompt-video pairs. Still, it showed the world its enormous commercial potential. Sora set the trend of being closed source. It was released with limited details, and all subsequent models from industrial labs followed suit, releasing even fewer technical details. For example, there is no public information about the internals of video models such as Kling, Bytedance Seaweed, or Google's Veo, which are considered competitive with Sora and MovieGen. The only exception is Meta's MovieGen foundation model. We have to give the Meta team their flowers. Despite limited public information, we can summarize some major trends of all the cutting-edge models in terms of design choices, model scale, and data. DiT (transformer backbone + diffusion objective) has become dominant after Sora's technical report. The model scale ranges between 5 to 50 billion parameters. The image data focuses heavily on internet datasets that more or less resembles DataComp and iis in the order of billions of text-image pairs. The video dataset focuses on YouTube and movie clips and ranges between 100 million to 1 billion of 5-20 second clips. These characteristics are typical in all the latest models 4 . Given the similarity in architecture, data, and model scale, the models' similar capabilities are not surprising. Different models have different focuses. For example, some focus on generating movie-quality clips, demonstrating close-up shots, rack focusing, zoom shots, etc. Some focus on consistency in scene changes, and some on motions. In my opinion, all the output videos feel similar. Pre-Sora generated videos look a certain way. They are mostly animated pictures with a low frame rate. The post-Sora videos are similarly recognizable. They are 5-10 seconds long and feel sci-fi and staged. It could not easily generate realistic, real-life clips similar to what I could capture with my phone. Each model tends to do well in some areas but not all. To me, this is evidence that video models are still early in their development cycle. There are still a lot of low-hanging fruits. For each deficient area from one model, other models have already shown it is feasible to overcome that deficiency. Each model focuses on a particular set of data, but no single project team is an order of magnitude better than others. The same could be said about model scale. I would say that if a team possesses sufficient resources, comparable to those used in training GPT4 (in terms of hardware, data, and engineering talent), they should be able to build a model that is significantly superior to the current generation. The latest video models are mostly closed to the public. I suspect a few reasons. First, inference cost is high. The models are built as research models that are not optimized for production to serve tens or hundreds of millions of users. I tested running a CogVideo 5B model on an 80G A100 GPU in AWS . To generate a reasonable video from a prompt, it took about 3-5 minutes. The cost is about $0.1 per video. That is assuming full GPU utilization over the course I rent the GPU . In practice, the cost per generation is much higher 5 . Inference will likely require additional optimization both at the model and distributed system level to push inference time from minutes to seconds to provide a reasonable user experience. Second, the models still produce a lot of unrealistic videos. The released prompt-videos pairs are heavily filtered to show their potential. Third, productization is hard. Each AI lab has gotten enormous resources to speed up their foundation model development. Optimizing models for millions of users takes time. Turning foundation models into product features, such as video editing, or multimedia content-creating tools, takes a different set of people and skillsets. High-quality user experiences will take time to emerge. Footnotes I would say that the two frontiers in the year of 2024 have been LLMs and video generation. Industrial AI labs are AI teams within large tech corporations or extremely well-funded startup companies. Examples are Google, Facebook, Bytedance, OpenAI, and Anthropic. ↩ Alternative to transformer is CNN or LSTM . Alternatives to diffusion is an autoregressive objective or direct density estimation. ↩ An update on 2024-12. This field is moving very fast. In just two months, my tables are already outdated. There are more models to be considered: Pika, Luma, Hunyuan, Haiper, Sora with public access, and Veo 2. Even though some people have asked me to update, I probably won't. The goal of my notes is only to document a snapshot of my thinking. ↩ . All the major teams are secretive about their effort. But small tidbits of information leaked in various public reportings. ↩ Overall utilization is going to be much lower than 80-90%. Ten cents is just the GPU cost. Each interaction with the video model feels a lot like posting an Ethereum transaction. A service is not going to be able to make money from a $20 per month subscription if the product has high engagement. ↩","tags":"misc, , ai","url":"/2024-11-02-video-models","loc":"/2024-11-02-video-models"},{"title":"Analytics for Binary Blobs","text":"One of the most underrated changes that the recent AI models made possible is data transformation for unstructured data. For the purpose of our discussion, we loosely say that structured data is tabular data, and unstructured data is binary blobs. For example, free form texts, images, and audios are unstructured. Before the recent transformer base LLMs and vision-language models, extracting useful information from unstructured data requires purposely engineered pipelines. With the power of the latest model, many data points could be extracted by selecting a model and setting a prompt. For example, say we want to get the key financial metrics (e.g. profit margin, revenue growth, and debt ratio) from annual reports. In the past, this is a full pipeline that is custom built and validated. It is not something that could be done in a few hours.It is now conceivable that this is just a SQL query that only requires a few minutes to write. Let do a short short review of analytics databases. I had an old post on this topic that discussed the realtime and batch mode divide, and at the time, I implicitly ignored unstructured data because established analytics solution do not handle binary blobs. In this note, we focus on structured vs. unstructured data. To simplify discussion, I also ignore analytics solutions tailored to transactions, caching, and streaming processing, e.g. mysql, redis, kafka. For structured data, the most popular options are Hive, Iceberg, Delta Lake, Snowflake. For unstructured data, they are almost always stored in object store as S3, Google Cloud Storage, or Azure Object Store. Analytics databases are designed to store and process large amount of structured data. Data is a table where each column has a defined type such as text, number, or boolean. Each table scales to trillions of rows or beyond. A single table could have petabytes or data 1 . The query pattern is something logically equivalent to SQL statements. Query execution is equivalent to batch processing. The most popular query engines are MapReduce, Tez, SparkSQL, Presto, Flink, and Ray. Most databases support a variety of these engines. For example, Hive, Iceberg, and Delta Lake all support Presto, Flink, and Spark. In many ways, these solutions are similar 2 . I will use Iceberg as an example to illustrate the key components a data analytics solution. A workin Iceberg deployment has three components: data, metadata, and query engine. Data are stored as parquet files in an distributed file system or an object store. A table's metadata layer is managed by metadata files, which manage manifest files, which manage manifest lists, which manage data files. All four of these file types (metadata files, manifest files, manifest list, and parquet files) contain some sort of hints, e.g. partitions, column information, value ranges or bloom filters, that are use for skipping data 3 . Query execution first filters data and then apply some compute actions. They are translated to a distributed compute engine such Spark and Presto. Iceberg Table Internals Other solutions such as Snowflake, Delta Lake, Hive are similar in broad strokes. For example, the key difference between Hive and Iceberg is how metadata is organized. Hive's metadata is much simpler than Iceberg. A Hive table metadata only keep tracks of what partition is registered. Each partition corresponds to a S3 dir. Iceberg has much more design around metadata. Object stores do not offer analytics capabilities. Object stores could be viewed as key-value store or a distributed file system. When it is viewed as a key-value store, we should not be confused it as a low latency distributed key-value store in the vein of dynamodb, aerospike, or distributed Redis. Object store is not meant to be used for read/write 10 bytes object millions of times per second. Another way to look at an object store is that it is a horizontally scaled out file system. In fact, most of the previously mentioned analytics database use a object store to as the storage backend for data files. Object store is extremely simple yet powerful. It is a foundational component of almost all modern data platforms. However, in order to get useful information out of billion or trillion of binary blobs stored in object store, we usually have to build custom data pipelines to extract information. In the past, we only bother to run queries on structured metadata because it is expensive to extract useful information from unstructured data. It requires a lot of validation to get each extraction to be sufficiently high fidelity. With VLM and LLM , each new question is literally just another question written in plain English. Working with unstructured data is no longer restricted to teams with expensive engineering talents. Anyone should be empowered to ask meaningful questions and be given the ability to execute those queries. Anyone who is willing to execute SQL had been able to get meaning out of structured data. The data transformation from unstructured from structured should be just as easy 4 . The most common solution to get analytics from unstructured data is to write one-off custom batch jobs. The batch jobs are written as distributed data processing programs, sometimes run directly on VMs, on top of kubernetes, or integrated with distribute compute frame work like Spark or Ray. The job create new structured data, which could combined with existing tables to generate reports or new datasets to be used for training or other data science needs. For example, a team has a trillion of binary objects, e.g. texts or images. The team continually adds new data points because the team has new documents. The team continually extracts information, e.g. new questions are asked about the text documents or the images. The newly extracted metadata is combined with existing metadata, all expressed in SQL , to generate new reports or training datasets. A reasonable solution is to use Iceberg to store metadata and S3 to store binary blobs. When new data comes in, they go through a data processing pipeline, which is its own independent system. The processing pipelines generate metadata and ingest into Iceberg. To ask new questions about the documents, we create new processing pipelines to generate additional metadata to be merged or append to IceBerg. We run SQL queries in Iceberg to generate reports or training datasets 5 . This kind of system is functional but not ideal. The processing pipeline needs be built, maintained, and execute. It would require dedicated engineering know-how. It also slows down iteration. Multiple components have to be executed and monitored separately. Data users such as business analysts, data scientists, AI researchers who are not also pipeline builders could learn to use the system by ways of better documentations or trainings. But in practice, it invariably requires cycles by someone with deep knowledge of the engineering pipelines. Snowflake is the one of the few solutions that attempts to give out-of-the-box analytics capability to unstructured data . It allows users to write User Defined Functions to process binary objects. However, it still feels that its unstructured data processing capability is an independent feature from its core SQL analytics. It does not integrate key-value object references directly with its tabular table. UDF is defined in Java. It is hard to translate AI models into SQL functions. I would want to see an analytics solution that integrates tabular metadata with arbitrary binary blobs. The user experience goes something like the following. Data storage is table that supports both structured data and binary blobs. For example, we have a table of text documents and images. We have 10 columns of metadata for each object. There is a column that references its binary data as a S3 key. Users want to create a new datasets that has 3 columns from the table and two additional columns extracted from the binary blobs. They could write a statement similar to CREATE FUNCTION are_people_surfing ( file BINARY ) RETURNS BOOLEAN ... CREATE FUNCTION is_there_kid_in_the_image ( file BINARY ) RETURNS BOOLEAN ... SELECT col1 , col2 , col3 , are_people_surfing ( img_col ), is_there_kid_in_the_image ( img_col ) FROM table It might looks like it is only an incremental change from existing analytics solutions such as Snowflake and Iceberg. The key components are: an unified view for both structured and unstructured, enabling analytics of binary blobs at query time, and integrating LLM and VLM models in SQL queries. An out-of-the-box data analytics solution with these features would reduce the engineering overhead for many data engineering teams 6 . Furthermore, it will empower many data users to directly work with data by asking plain English questions. I have personally seen many different teams have this need and devoted sizable engineering resources to fulfill that need. These teams have a lot of unstructured data would are cumbersome to use with existing out-of-the-box solutions. Footnotes In theory, it could go up to exabyte, but I have not personally use these solutions at that scale. ↩ I don't really buy the marketing hype of lakehouse. Hive was the standard open source data warehouse solution for many years. Marketing folks coined the term lakehouse to differentiate the new comers (e.g. Iceberg, Delta Lake, and Hudi) from existing data analytics solution, which used to be called data warehouse. To me, Iceberg is a straight-forward next iteration of Hive, and it is solving mostly the same set of problems with additional optimizations. It is silly to create a new category because the problem statements have not changed. ↩ A query is able to first take advantage of the various metadata available to filter out unnecessary data files to use. Finding files is just working through the metadata files, it is done locally and does not involve a distributed compute engine; this is known as scan planning . For example, manifest list contains value ranges, manifest files has column level stats, and parquet file has row groups, each row group has column level stats. ↩ Furthermore, it is becoming more and more robust that we could create SQL queries just by prompting. The next iteration is clearly that using plain English, we could create datasets or reports by just using English for both structured and unstructured data. ↩ The most popular framework to build these processing pipeline is Ray, especially if GPU -heavy models are used frequently. It could be done with Spark. It is also not uncommon to write distributed jobs from scratch combining with simple job queues backed by Redis or MySQL. ↩ Let us sketch out how to build this solution with popular open source components as of the writing of this note. We pick S3, IceBerg, and Ray to be the backbones. Iceberg is the metadata store. S3 could be viewed as an internal component of Iceberg. Ray is the compute engine. The main reason to choose Ray here is that we want to be able to integrate AI models and SQL executions. As of the writing of this blog, Ray dataset does not provide a SQL API for writing data transformations. We need to build three major components to get to a working version: translating SQL statements into Ray Dataset execution and connecting models to user defined SQL functions. ↩","tags":"misc, , ai","url":"/2024-10-28-binary-storage-engine","loc":"/2024-10-28-binary-storage-engine"},{"title":"Open Source Vision Datasets","text":"In the past year I have worked with a few projects that use vision datasets. People might give them fancy names, but in simple terms, they are just datasets with images and videos that are annotated with text descriptions. For the purpose of training large models, projects almost always start with open source datasets. They are the most accessible, and more likely than not, they will make a large contribution to the full training datasets. There are two common additional sources, commerical (e.g. iStock, Shutterstock, Leixa , Alamy , Artistation , Adobe Stock) and internal datasets (e.g. JFT -300M, JFT -3B, IG -1B). I am compiling most of the well known open source datasets into a few tables to showcase how vision datasets have evolve over time. Image Dataset before ImageNet Name Released Size Source Org Notes FERET 2000 14k curated NIST faces caltech-101 2003 9k curated Caltech objects Yale Face B+ 2005 16k curated Yale faces Caltech-256 2007 30k curated Caltech objects Oxford-5k 2007 5k curated Oxford buildings LFW 2007 12k web U Mass. Labeled Faces in the Wild PASCAL VOC 2007 30k curated Microsoft objects MIRFLICKR -25K 2008 25k flickr Leiden objects, concepts Tiny Image 2008 80m web MIT classification NUS - WIDE 2009 270k flickr NUS classification SUN 2009 130k web MIT environmental scenes, places and the objects ImageNet 2009 14m web Princeton classification Table 1: Early Image Datasets ImageNet kick started the revolution of training large, deep neural net model using large amount of image. While image datasets existed before ImageNet, they were much smaller in scale. They tend to cover a narrow, specific purpose. They all involve human curation and annotations. ImageNet's key differentiator was its size. It was considerably larger than previous dataset. Its labels were still annotated by humans, maintaining its accuracy. There was a curious case of TinyImages because it also had scale, but it did not have nearly the impact that ImageNet had. The key differentiator was that ImageNet retained the image's full resolution, it used human annotation, and it hosted a very popular benchmark competition. Image Datasets Name Released Size Source Org Notes TinyImages 2008 80m web MIT ImageNet 2009 14m web Princeton SBU caption 2011 1m flickr Stony Brook flicker 30k 2014 31k flickr Urbana-Champaign yfcc100m 2015 100m flickr Yahoo open images 2016 9m web Google JFT300m 2017 300m Google not public cc3m 2018 3m web Google Conceptual Captions IG -1B 2019 1b Instagram Meta not public LAION 400m 2021 400m common crawl LAION cc12m 2021 12m web Google redcap 2021 12m reddit U Michigan WIT 2021 11m wikipedia Google WebImageText 2021 400m web OpenAI - not released - CLIP ALIGN 2021 1.8b web Google - not released - ALIGN JFT3B 2021 3b Google not public LAION 5B 2022 5b common crawl LAION Coyo 700m 2022 700m common crawl Kakao Brain CommonPool 2023 12.8b common crawl LAION DataComp Table 2: Web Datasets Name Released Size Source Org Notes CIFAR -10 2009 60k TinyImages CIFAR classification ImageNet 2009 14m web Princeton classification MS - COCO 2014 300k web Microsfot - segmentation - context - multiple captions Visual Genome 2015 100k ms-coco yfcc100m Stanford - relationships - bounding boxes Open Images 2016 9m web Google - relationships - segmentation - labels Hierarchical Paragraphs 2017 15k visual genome Stanford - Hierarchical Approach for Generating Descriptive Image Paragraphs - dense paragraph Localized Narratives 2020 900k ms-coco, open images Google - voice and mouse movement Hateful Memes 2020 10k Getty Meta Crossmodal-3600 2022 3.6k curated Google 36 languages Segment Anything 2023 11m - licenced Meta - segmentation DCI 2023 7.8k segment anything Meta - Densely Caption Images ( DCI ) - multiple rounds of human annotations vision2ui 2024 3m web Peking ui images DOCCI 2024 15k curated Google - highly detailed description - donated by one person IIW 2024 9k curated Deepmind - ImageInWords - highly detailed - sequentially refined Table 3: Curated Datasets Name Released Size Source Org Notes clevr 2017 100k generated images stanford, meta - visual question answering LAION -coco 2022 600m LAION 2b laion - recaptioning LAION -translate 2022 3b LAION 5b laion - machine translation COYO -Labeled-300M 2022 300m coyo kakaobrain - machine labels - ImageNet labels Pixel-Prose 2023 16m - LAION - cc12m - redcap u maryland - dense caption - using Google Gemini Pick-a-Pic 2023 500k generated images stability ai - user prompts - generated images DAC 2023 3m cc3m IBM - Dense and Aligned Captions - LLM expansion - segmented parts Recap-Datacomp-1B 2024 1.3b common-pool santa cruz - recap with LLaVA Table 4: Examples Synthetic Datasets Image dataset scales from 10's million from ImageNet dataset to 10's billions in recent years. The model and training dataset size increases exponentially because of transformers-based deep learning models. Companies such as Facebook and Google first experimented with internal datasets that are in the 100s million to billions ranges (e.g. IG -1B and JFT -300m). It took a few years for the open source community to catch up. LAION adopted similar automatic scraping and filtering techniques to curate LAION -400m, and later on with LAION 2B, and Common-Pool. These images became the backbone of data sources for all image generators and multimodal models. The latest dataset is the 12B dataset Common-Pool. We are probably reaching the limit through scraping the common crawl archive. It is possible that we get to 100 billion, but in my opinion, that is the upper bound of the public domain. There are other data sources that could break this scaling limit. Social media platform such as Instagram has 100s of billion of posts. The total image counts in personal photo archive (e.g. Google photos, Apple photos) should be well into the trillion level, and they contain much higher quality photos. For example, DOCCI is an example where one researcher is able to curate more than 10,000 high quality images in a few years. IoT devices could be a rich source of image data as well. The quality and the diversity of the images in these datasets resembles the content of LAION -400m. This is evidenced by how similar generated images look and feel across different the latest models and products. The content of the datasets such as yfcc100m, LAION -2B, Common-Pool, and Coyo are remarkably similar. One could extract similar subsets from those datasets. For example, if one were to extract a subset of images that filters for humans, aesthetic, size, aspect ratio, and text-image alignment, the filtered subset have very similar images. I suspect that is just the nature of images in public digital domain. It is a reflection of the evolution of digital media. More likely than not, the Google and Facebook private datasets look similar to the open source, web-scraped versions. The text descriptions are still in the middle of active development. We have a few varieties. The first generation of text is merely for classification. They are objects and concetps. Examples are TinyImages, ImageNet, and SUN . The second variety is description scraped from the alt attribute in html img tag. This description tends to be short, noisy, and could contain irrelevant texts. The third variety is typified by DCI , IIW , and DOCCI . In the last couple of years, it is all about vision-language models. That means the text is desired to be general and contain as much details as possible. The descriptions in DCI , IIW , and DOCCI are human annotated through multiple rounds. They are impossibly detailed. They are purposely and painstakingly annotated to include a lot of details. They are time consuming to create. These datasets are rare and small, in the 10k range 1 . Fourth type is recaptioning. This is a big trend. Most of the synthetic data focuses on recaptioning. A lot of the image, video, and multimodal model training dataset choose to use recaptioned descriptions over raw texts 2 . Video Datasets Name Released Size Source Org Notes ucf-101 2012 13k clips youtube U of Central Florida classification activity-net 2015 30k videos youtube KAUST classification kinetics 600 2018 500k clips youtube Deepmind classification ssv2 2018 220k clips crowd sourced Qualcomm - action recognition - video understanding how-to-100m 2019 1.2m videos, 136m clips youtube Deepmind - instruction videos movienet 2020 1000 movies CUHK WebVid-10M 2021 10m clips Shutterstock Oxford - low resolution - poor visual - watermarks merlot 2021 6m videos, 180 m clips youtube Allen Institute for AI - aim to be generic, diverse - YT -Temporal-180m celebv-hq 2022 15k videos youtube Nanyang Technological University - focus on humans hd-vila-100m 2022 3.3m, 100m youtube Microsoft - aim to be diverse VidChapters-7M 2023 817k videos, 7m chapters youtube Meta HD - VG -130M 2024 1.5m, 130m youtube Microsoft panda 70m 2024 3.3m, 70m HD - VILA -100M Snap - recap merlot reserve 2024 20m youtube Allen Institute for AI YT -Temporal-1B Table 4: Examples Synthetic Datasets Almost all of the open source video dataset use Youtube as the data source. All the significant datasets are of this variety. Their creation process is simple. They scrape youtube ids based on some criteria. For example, celebv-hq focuses on celebrity. How-to-100m focuses how-to videos. HD -Vila and Merlot datasets attempt to cover diverse topics. These youtube videos are cut to 5 to 20 seconds short clips, usually cut on the scene changing boundaries. Different dataset might apply additional filtering, processing of metadata, and recaptioning. Youtube is the singularly important data source for video dataset. If Google were to pursue serious law suits against major AI model developers and open source researchers in the near future, it could impact many research projects and video AI products. There are alternative sources of videos, such as movies, broadcast studio archives, short video platforms, and alternative video platforms. Still, youtube is king. In theory, video datasets should be orders of magnitude larger than image datasets. In practice, open source video datasets are comparable to image dataset, or slightly smaller. Each youtube video is roughly 30MB . The largest video dataset is about 500TB (Merlot Reserve). This comparable to LAION 2B. This is partly due to the compute resources those dataset creation teams have access to. Large industrial labs such as OpenAI are certainly working to create datasets an order of magnitude larger than these. It also indicates that video dataset is still in its early stages. Unlike image datasets, I expect video datasets to dramatically increase its scale in the coming years, if not the coming months. Video models should still have a lot of room to grow. It is exciting to see what kind of video editing will be enabled by the next generation of video models in the next few years. Footnotes It would be interesting to see a team to devote a lot of human resources to create a 500k dataset with such detailed description and use that to train a captioning model. That is one experimental design that no one has attempted to push the limit of language-vision models. ↩ Model interdependence is a fascinating phenomenon that is not widely discussed yet. Most generated outputs look remarkably similar. Their architectures are similar. Their training data are similar. They also use a lot of the same underlying models to process data or create synthetic data. Recaptioning is the most common such examples. From a information theoretical standpoint, the models all contain knowledge. It is not surprising their results are similar. It is almost tempting that they will invariably converge as models get larger and become more capable, unless they contain different starting information. ↩","tags":"misc, , ai","url":"/2024-08-01-vision-dataset","loc":"/2024-08-01-vision-dataset"},{"title":"The Role of Neural Networks in Generative Models","text":"Generative AI is an immensely popular topic. We have seen many new models coming out the last few years. These models generate impossibly high quality samples in almost all digital media: text, images, speech, and music. This blog post takes a look at how some of these models are formulated. I focus on making it obvious how neural networks are used as the key technique to approximate the most intractable components. My goal is to demystify these generative models, and empower distributed system engineers to dig deeper and become comfortable contributing to writing high performance codes for inference and training of AI models. Neural Networks as Approximations A neural network is a parametrized function. A linear regression is a parametrized function. A neuralnet is a complicated version of that. The act of training is to optimize the parameters based on data. A modern deep neural network is the latest iteration in numerical techniques on how we could approximate extremely complex, high dimension real-world functions. A generative model is the easiest to be understood if we start writing down its inputs and outputs. For example, a text-to-image model takes text as input and output an image. The current state of the art model usually describes a series of interpretable transformations 1 . Some of these transformations are easy to program, but some have to be approximated. The approximations are done by neural networks, where their parameters are learned from data. Let's take diffusion image generation as an example. We can program the forward diffusion process. The starting image is \\(x_0\\) . From \\(x_{t-1}\\) to \\(x_{t}\\) , we add gaussian noise to each pixel at each time step. Image generation is the reversed process, where we start with the pure white noise and denoise the image step by step. It should be clear that it is not possible to just write down a formula and program the reverse process. However, the reverse process exists. We take a set of images, \\(\\{x_0 \\}_i^n\\) , we run the forward process, we would be able to get a set of dynamic process \\(\\{x_t \\}_{t=0}^{T}\\) . There exists a time-dependent probability transition function that describes the reversed process. That it, we should be able to sample \\(x_{t-1}\\) given \\(x_t\\) from \\(p(x_{t-1}|x_t)\\) . We represent this conditional probability as a parametrized neural network \\(p_\\theta(x_{t-1}|x_t)\\) , where \\(\\theta\\) is the parameters. At this point, the question is about how to find the optimized parameters of the neural network. At the core of most generative models is a high dimensional probability distribution. Instead of working directly with text, image, sound, or video, we would like have a mechanism to convert those media into a more convenient encoded space. This conversion step is usually learned from data. There is a decoder that are built jointly with the encoder. The algorithm to calculate or train the encoder-decoder system is not compute heavy relative to the approximation step of learning the sampling probability distribution. Much of the complexity of modeling is deciding which probability distribution to approximate. The approximation must be constructed in such a way that it could be efficiently learned from data, and the approximation is able to generalize well in the desired domain. Generated data are sampled from the learned probability distributions. The sampled data is then decoded to the desired media format. It is worth noting that neural network is not the only way to approximate a high dimensional function. In one extreme, we know that linear methods are way too simple to be useful. In another extreme, it is not like we could simulate the world at the quantum level to observe macroscopic behaviors. There are previously many different techniques used to estimate these density functions, such as MCMC , dimensionality reduction techniques, kernel density, bayesian methods, etc. However, they do not perform well enough to support the current generative models. The deep neural network approach enables a scale of learning and capability that is orders of magnitude more performant than previous methods. Examples For each of these generative models, my aim is to succinctly describe two parts. The first part is what the neural networks represent. The second part is how to train those networks. The first part is usually very simple to use in practice, but almost always hard to put into words about its exact meaning. It is simple because we could just treat those trained neural networks as blackbox functions. We only need to understand the inputs and outputs. They are simple mathematical objects. In fact, they are almost always organized as high dimension tensors. They sometimes represent things we can easily correlate to physical objects, such as a \\(3 \\times H \\times W\\) tensors, would represent an image. However, some of these functions would have inputs outputs that are less easy to be described in words. If we suspend our curiosity for interpretability, it is not hard to understand that a generative model is nothing but a series of transformations. The second part is about how to learn. Training a neural network is about updating their parameters. Samples are fed into the model, a loss is calculated, and the loss value provides guidance on how to update parameters. This process repeats itself for each batch of data. The tricky part is to explain the rationale behind each model's unique choice of loss objective and what it is estimating. I will not go into too much details on those derivations. Instead, I will put on the engineering hat and just look at these loss objectives as they are written out. I want to describe in as little details as possible, but enough so that we could program these training steps. The goal here is to demystify these models to the extend that if we were to asked to rewrite both the training and inference components, we should be able to figure out the exact computations and be armed with sufficient theories to start writing high performing programs to perform the computations. Below is a summary of the models to be discussed. Model Trained Neural Networks Sampling Process VQ - VAE - codebook embedding \\(e_{\\theta}\\) - encoder \\(E_{\\theta}\\) - decoder \\(D_{\\theta}\\) - priors \\(p_\\theta\\) - sample latent codes from \\(p_\\theta\\) - feed the code to decoder Diffusion via Score Matching - estimate \\(\\epsilon_\\theta\\) - \\(\\epsilon_\\theta\\) solves for \\(\\mu_{\\theta}\\) , which solves \\(p_\\theta\\) - \\(p_\\theta\\) governs the probability transition from \\(x_{t-1}\\) to \\(x_t\\) Diffusion via SDE - estimate \\(s_{\\theta}(x)\\) to approximate \\(\\nabla_x \\log p(x)\\) - numerically solve reverse SDE - SDE governs \\(x_{t-1}\\) to \\(x_t\\) transition Diffusion via CNF - estimate \\(v_t(\\theta)\\) to approximate a vector field that generates \\(p_t\\) - Solve time-dependent probability \\(p_t\\) - \\(p_t\\) governs \\(x_{t-1}\\) to \\(x_t\\) transition GAN - image generator - image discriminator - run the generator DALLE - visual encoder-decoder - autoregressive seq model - encode text by BPE - generate the text-image token sequence autoregressively - decode image tokens into image VQ - VAE I will unpack the Vector Quantized Variational AutoEncoder ( VQ - VAE ) model, loosely based on vdOVK18 . Fig. from vdOVK18 There are four components that are parametrized: the codebook embedding \\(e_{\\theta}\\) , encoder \\(E_{\\theta}\\) , and decoder \\(D_{\\theta}\\) , and the priors \\(p_\\theta\\) over the embedding space. The codebook is \\(e_{\\theta} \\in \\mathbb{R}^{K \\times D}\\) . \\(K\\) is the size of the codebook, and the \\(D\\) is the code length for each of the embedding. \\(\\theta\\) denotes the entire set of parameters, which is learned through data. Note that the codebook is learned. The encoder is a neuralnet. It could be any neural network. vdOVK18 uses a CNN , but this is a design choice that could be experimented. The exact architecture is not required by theory but will greatly impact empirical results. The encoder takes an image, \\(x \\in \\mathbb{R}^{3 \\times H \\times W}\\) as input, and outputs into the embedding space \\(\\mathbb{R}^{D}\\) . The full dimensionality of this stage depends on the neuralnet architecture. For example, we could choose a \\(32 \\times 32\\) embedding vectors to represent an image of \\(128 \\times 128\\) . This output is quantized and drop its embedding dimension \\(D\\) . Each embedding is quantized into a number \\(z \\in \\{1, ... K\\}\\) . That is, each embedding vector is no longer a \\(D\\) -vector but just a number. Lastly, the decoder is another neuralnet that takes the quantized embedding and output an image in \\(\\mathbb{R}^{3 \\times H \\times W}\\) . The prior \\(p_\\theta\\) is over the embedding space. It could be such that it is conditioned on some labels. That is, \\(p_\\theta(z | l)\\) , where \\(l\\) represents label classes. The prior allows us to sample an embedding based on a class label. Image generation is straight forward. First, we sample encodings from the priors neural network \\(p_\\theta(z|l)\\) . Second, the encodings are fed through the decoder network \\(D_{\\theta}\\) to generate an image. This methodology also applies well to music generation; see DJP+20 . The only difference is that instead of \\(x\\) representing an image, it represent an audio segment. The key question is how to train these 4 components: \\(e_{\\theta}\\) , \\(E_{\\theta}\\) , \\(D_{\\theta}\\) , and \\(p_{\\theta}\\) . This is broken down into two stages. The first stage approximate \\(e_{\\theta}, E_{\\theta}, D_{\\theta}\\) . Let's write down the loss function associated with them: \\begin{equation} \\mathscr{L}(x; \\theta) = ||x - D_\\theta(x)||_2^2 + ||sg[E_{\\theta}(x)] - e_\\theta||_2^2 + \\beta ||sg[e_\\theta] - E_\\theta(x) ||_2^2 \\end{equation} Note that \\(D_\\theta(x)\\) is an abuse of notion to denote the generated image if we take the input as the quantized encoded embedding. The first term is the reconstruction loss, the second term is a simple vector quantization loss, and the third term is the commitment loss to ensure that embedding space does not grow too large. The goal here is not to explain how to derive or improve these loss terms, we want to know how to operationalize the training by using data. With this loss defined, it is now clear that all we have to do is to feed data into all the parametrized functions (e.g \\(e_{\\theta}, E_{\\theta}, D_{\\theta}\\) ), calculate the loss, and then perform gradient descent with each batch of data. The second stage approximates priors' probability distributions of the encodings, \\(p_{\\theta}\\) . It might be tempting to model the density functions explicitly via log likelihood, cross-entropy, or other probability divergence measures. This approach is empirically useless because the dimensionality of embedding space is too large. One of the breakthrough in AI is the ability to model probabilistic model with autoregressive models, as evidenced and made hugely popular by the success of LLM . This technique applies here as well. The encodings are treated as any other high dimensional object, in this case \\(e = (e_1, e_2, ..., e_D)\\) . The model take a partial vector \\((e_1, ... e_i)\\) as input and predicts the next token \\(e_{i+1}\\) . The loss could be just a L2 loss between \\((e_1, ... e_i, e_{i+1})\\) and \\((e_1, ... e_i, \\hat{e}_{i+1})\\) . This simple setup allows us to update the neural network. The current state of the art uses neural networks that are transformer based. See BYAV13 , CMRA17 , CKS+17 , RWC+18 for more details about estimating high dimension joint probability distribution. See DJP+20 , RvdOV19 for details about more design space for vq-vae encode-decoder system. Diffusion via Score Matching One of the most popular image generating model is diffusion. We take a look at the model presented in HJA20 . \\(x\\) is in the image space. There is a diffusion process \\(x_t \\sim \\mathscr{N}(x_{t-1}, I)\\) such that, \\(x_0\\) is the original image, and \\(x_t\\) is the previous image \\(x_{t-1}\\) plus some white noise. The generating model is the reverse of this process. We model this reverse process with a transition probability density. The transition process is represented as \\begin{equation} p_{\\theta}(x_{t-1} | x_t) = \\mathscr{N}(x_{t-1}; \\mu_\\theta(x_t, t), \\sigma_\\theta (x_t, t)) \\end{equation} For simplicity, we set \\(\\sigma_\\theta\\) to be fixed and only focus on \\(\\mu_\\theta\\) . We would like to approximate \\(\\mu_\\theta\\) using a neuralnet. Once we have that approximation, the generating process is as simple as just start with a white noise \\(x_T\\) , and then sample \\(x_{t-1}\\) from \\(x_t\\) based on the transition probability \\(p_\\theta\\) . We repeat this transition for \\(T\\) steps. To approximate \\(\\mu_\\theta\\) , we rewrite and parse out a new quantity \\(\\epsilon_{\\theta}\\) , defined as \\begin{equation} \\mu_{\\theta}(x_t, t) = \\frac{1}{\\sqrt{\\alpha_t}} \\left[ x_t - \\frac{\\beta_t}{\\sqrt{1- \\bar{\\alpha_t}}} \\epsilon_{\\theta}(x_t, t) \\right] \\end{equation} A neuralnet is setup to represent \\(\\epsilon_{\\theta}\\) and is optimized by training on this loss, \\begin{equation} \\mathscr{L}(x_0, t; \\theta) = || \\epsilon - \\epsilon_{\\theta} ( \\sqrt{\\bar{\\alpha_t}} x_0 + \\sqrt{1- \\bar{\\alpha_t}} \\epsilon, t) ||^2, \\end{equation} where \\(\\epsilon \\sim \\mathscr{N}(0, I)\\) and \\(t \\sim U(1, ..., T)\\) , and \\(x_0\\) is a data sample. The loss could be calculated for each data point. The complexity of this generating model is deriving what the neuralnet supposes to represent and the loss function. But when these entities are written out, it is relatively straight forward to understand the computations both in inference and training stage. See RBL+22 for an improved version of this diffusion model. Diffusion via SDE The diffusion process could be formulated as a stochastic process. This is my personal favorite because the theory is succinct and compact. Let \\(\\{ x_t \\}_{t=0}^T\\) be the forward diffusion process modeled as an Itô integral, \\begin{equation} dx = f(x, t)dt + g(t) d \\mathbb{W}, \\end{equation} where \\(\\mathbb{W}\\) is a Wierner process. \\(f(x,t)\\) is a drift term, and \\(g(t)\\) quadratic variation. For simplicity, we set them to be time-dependent constants. The reverse process is a known math result, see And82 , \\begin{equation} dx = \\left[ f(x,t) - g(t)^2 \\nabla_x \\log p_t(x) \\right]dt + g(t)dW, \\end{equation} where \\(dt\\) is negative timestep and \\(W\\) is a backward Wierner process. We can solve this backward SDE numerically if we know the term \\(\\nabla_x \\log p_t(x)\\) . We estimate \\(\\nabla_x \\log p_t(x)\\) with a neuralnet. With that, we have a generating model because the reverse process is fully described by the backward SDE . The neuralnet that needs to be learned from data is \\(s_{\\theta}(x, t) := \\nabla_x \\log p_t(x)\\) , which SSDK+21 names the score function. It shows that this neural network could be efficiently trained by minimizing the objective \\begin{equation} \\mathscr{L}(x, t; \\theta) = \\mathbb{E}_{p_{data}(x)} \\left[ tr(\\nabla_x s_{\\theta}(x)) + \\frac{1}{2} ||s_{\\theta}(x) ||^2 \\right] \\end{equation} The expectation is estimated by the batch average of training samples. There are additional techniques to training the score network that works with perturbed sample data; see BYAV13 . SSDK+21 uses a random projection to approximate \\(tr(\\nabla_x s_{\\theta}(x))\\) . Regardless of training methods, the key is that \\(s_{\\theta}\\) is approximated by neuralnet that could be efficiently trained from data samples. Diffusion via Continuous Normalizing Flows (CNFs) The continuous normalizing flow formulation is slightly involved but a more general approach than other diffusion setups. We follow the notation in LCBH+23 . Let \\(\\{ x_t \\}_{t=0}^T\\) be the series of transformation from noise to data. The time-dependent probability path governing this transformation is \\(p_t\\) . We define a time-dependent map \\(\\phi_t\\) , which is called the flow, \\begin{eqnarray*} \\frac{d}{dt} \\phi_t(x) &=& v_t(\\phi_t(x)) \\\\ \\phi_0(x) &=& x \\end{eqnarray*} Then, \\(p_t\\) is defined as, \\begin{equation} p_t = p_0 (\\phi_t^{-1}(x)) \\det \\left[ \\frac{\\delta\\phi_t^{-1}}{\\delta x} \\right] \\end{equation} The most important object is \\(v_t\\) , which is called the generating vector of the probability path. We approximate this vector by a neuralnet, \\(v_t(\\theta)\\) . The ODE and \\(v_t(\\theta)\\) solves \\(\\phi_t\\) , which lead to \\(p_t\\) . There are some traditional numerical methods to solve ODE , or we could use a neural ODE technique; see CRBD19 . \\(p_t\\) describes the transition probability of \\(x\\) . Let's describe how to estimate \\(v_t(\\theta)\\) . Consider the flow matching objective, \\begin{equation} \\mathscr{L}(x, t; \\theta) = \\mathbb{E}_{t, p_t(x)} ||u_t(x) - v_t(x; \\theta) ||^2 \\end{equation} But we don't know \\(p_t\\) and \\(u_t\\) . Instead, we could switch to a conditional flow matching objective, \\begin{equation} \\mathscr{L}(x, t; \\theta) = \\mathbb{E}_{t, q(x_0), p_t(x|x_0)} ||v_t(x; \\theta) - u_t(x|x_0)||^2 \\end{equation} This loss leads to the same gradient with respect to \\(\\theta\\) as the flow matching objective. With this transformation, we can get a solid handle on \\(p_t(x|x_0)\\) , and indirectly the generating function \\(u_t(x|x_0)\\) . For example, we can consider a special, gaussian probability path, \\begin{equation} p_t(x|x_0) = \\mathscr{N} (x | \\mu_t(x_0), \\sigma_t(x_0)) \\end{equation} It simply means that the transition is sampled from gaussian that has time-dependent mean and variance. This special flow leads to a rather simple form for \\(u_t(x|x_0)\\) \\begin{equation} u_t(x|x_0) = \\frac{\\sigma_t^{\\prime}(x_0)}{\\sigma_t(x_0)} ( x - \\mu_t(x_0)) + \\mu_t^{\\prime} (x_0) \\end{equation} Let see how we update the parameters of the neuralnet representing \\(v_t(\\theta)\\) . Take a batch of samples, the expectation is estimated over the sample batch. \\(u_t(x|x_0)\\) is directly calculated. We get the conditional flow matching loss value, and then we can perform gradient descent on \\(\\theta\\) . The CNF formulation is a generalization of diffusion model. Even if we were to model the same generating process, we could approximate different components. SSDK+21 uses the neuralnet to represent a score function, and LCBH+23 approximates a time-dependent vector field. GAN GAN model was introduced by GPAM+14 . It uses two neural networks, a generator and a discriminator, to model a competitive game between the two neural networks. Take the example of a text-to-image GAN model. The generator neural network takes text as input and output image. The discriminator neural network takes input and image pair, and output a probability on if the image is real or fake. GAN models tend to be small in parameter size. They are are easy to use because sampling only requires running the generator neural network once to generate new samples. Training a GAN model updates the two networks simultaneously. The discriminator loss function keep tracks of how well it could distinguish the fake and the real images given a text-image pair. The generator loss function keeps track of how well it could trick the discriminator. When we feed a batch of text-image pairs to the generators, we get fake images. We can use the text, real image, and fake images to calculate the loss for both of the discriminator and the generator networks, allowing for updates of both network's parameters. This colab and a pytorch tutorial nicely illustrate the training step of the adversarial game. See RMC16 for how CNN is used for a GAN model. Autoregressive Model ( DALLE ) Autoregressive model is made popular by GPT . An autoregressive model takes a token sequence as input and outputs one more token. The initial sequence and the predicted token form a new token sequence to be fed into the model again. This process repeats itself until the predicted token is a special STOP token. Training on an autoregressive objective is often called pre-training because raw data could be fed into the model directly. The raw data could be text, image, audio, or video. These data are encoded into token space as sequences, and each token sequences could be converted into multiple subsequences and the next token as the input and expected output for training. This paradigm works extremely well for text, the so called language models. We can look at a specific example that deals with image, the dalle model described in RPG+21 . It has two major components: the visual encoder-decoder system and the prior over text-image token sequence. The first component is similar to what we discussed in details in the VQ - VAE model. For discussion simplicity, we just assume that its encoder-decoder setup follows what is described there. The key difference lies in how dalle estimates the prior. The text is encoded by the BPE encoder, see SHB16 . This encoder is calculated from the corpus and does not require training a neural network. The text token length is padded to a fixed length of 256. The image is encoded by the visual encoder into the codebook space, which has dimension of \\(K\\) . The text and visual token sequences are concatenated to be used as input in the second component, an autoregressive model over the visual token space. The generating process starts with a text token sequence. It repeatedly generates the next token until the desired image token sequence length is reached. The image token sequence is then decoded into an image by the visual decoder. The BPE encoder is calculated directly from the corpus. This algorithm is fast and efficient. The visual encoder-decoder follows similar steps as discussed VQ - VAE . This takes the form of multiple neural networks. The autoregressive neural network is trained on raw text-image pairs. The loss objective is how well the neuralnet predicts the next visual token. This is a technique to indirectly model the full probability distribution of the visual token space. It is an approach that is well demonstrated by LLM to approximate high dimension probability space. See BYAV13 , CMRA17 , CKS+17 , RWC+18 . The neural network in this components could be many orders of magnitude larger than the visual encoder system. The majority of the training resources is spent on training for an neural network to estimate a probability distribution. Discussion I have not said much about the internal architectures of the neural networks described in each example. It is a point that I want to make that the role of neural network is not required in theory. Any high dimension estimation methods could work. However, neural networks have become the only meaningful way to approximate high dimensional function in these models. As the writing of this post, these neural networks invariably use CNN and transformer components. I would expect that the internal architectures will evolve, and we might see new class of internal architectures as soon as in a few years. One of the most important aspect of model formulations is deciding on what to estimate. This decision is usually guided by two factors. The approximated entity should be easy to use in the inference stage. For example, the inference of GAN model is much faster than a diffusion or an autoregressive token model. GAN model only needs to pass through the generating neuralnet once to get the result, but a diffusion step needs to be run \\(T\\) -many passes through the probability transition step. The other aspect of formulation is the efficiency of learning from data. It is easy to spot an entity that is useful to estimate with a neural network. For the example of an image diffusion process, it is obvious that we want to estimate the time-dependent, joint distribution that governs the reverse process. In theory, we could generate sequence samples from raw images, and use them to approximate the transition directly. This is not going to lead to good empirical results. Instead, we have the somewhat convoluted diffusion models in the form of score matching, SDE , and CNF . Each of these models make additional assumptions about the reverse process to allow for clever math so that we could derive some entities that could be efficiently learned from data. The learned models need to generalize well beyond sample data. The approximating neural network is trained on some loss objective. It is easy to get a neural network to fit the data well. The effectiveness of the model is not necessarily determined by this arbitrary loss objective, but on how well it performs for the intended generation task. The amazing thing about these deep learning techniques is that these tremendously large deep neural networks are able to acquire the ability to generalize to tasks that are not directly specified in the training data. Footnotes It is worth noting that some generative models does not contain any interpretable intermediate steps. It could be just one giant blackbox neural network model that transforms the text into an image. Human researchers might understand how individual computation is performed, but we might not able to make sense of any intermediate representations. ↩","tags":"misc, , ai","url":"/2023-10-17-gen-models","loc":"/2023-10-17-gen-models"},{"title":"First Notes on Wing Foiling","text":"Wing foil is one of the easiest ways to get into wind sports. It is relatively less gear to wing foil compared to windsurfing or kiteboarding. The ability to harness wind power to cruise around a very large playing field is freedom. There are a lot of easily accessible launch spots in the East Bay: Alameda beach, toll plaza, Emeryville Marina, Berkeley Marina, Albany Bulb, Marina Bay, etc. One could get into a dynamic water sport without having to cross a bridge to get to the coast in the San Francisco side. Wing foiling opens up opportunities to wave riding 1 and deflate sup foiling downwind. I have seen people winging up to Treasure Island and then SUP foiling downwind back to Berkeley Marina. What I learned Here is a list of lessons in the order that I encountered them. I will come back to update and add more in the future. Wing skateboarding. I went to a large parking lot to wing skateboarding. This taught me how to hold the wing, transfer wind power to my legs, and go upwind in both directions. Wing supping. I installed a large fin on my regular, large sup board, which allows the sup board tracks upwind easily. I wing-supping a few sessions until I was fully comfortable going upwind on water. I had to do the walk-of-shames during the first session. In most of my second session, I was able to maintain upwind ground. I was able to get in and out at the same spot without needing to walk. By the third session, I was able to cruise around and get to wherever I wanted on that stretch of beach. Wing foiling but really just wing dragging. I swapped out my sup board with a foil board. It felt different. The balancing was different. I wasn't able to stay upwind as much. The balance was different. It took me about 1 full session to be able to gain upwind ground in my strong direction (riding goofy). It took me at least 2 more sessions before I was able to comfortably gain upwind ground on my switched stance (riding regular). At this point, I had not attempted to learn toe-side yet. My next challenge was learning how to get up on foil. There was a lot going on. The most important thing was speed, and learning how to handle board speed while holding the wing was not intuitive. I spent time just focusing on getting more comfortable riding with speed without allowing my foil to fly. Once I was able to maintain speed with control, I could engage and disengage the foil. I learned foiling comfortably without sketchy moments that way. Foiling is different from a riding surfboard. Once the board rises above the water, the board is very tippy left and right because the board's foam is no longer counter-balancing on the water. It is a lot like riding a bike. The side-to-side roll is sensitive. My feet and body have to constantly balance to keep the board tipping over on the side. The moment the board left the water, it is a completely new boarding experience. I felt weightless, speed, and out of control. I intuitively learned how to feel the wind. It was a strange feeling. Once I get enough water time, maybe 5-7 sessions in, I could feel the wind direction, wind strength, and sometimes even kind of feel the gust pockets. Obviously, I cannot see the wind. It is possible that I could hear the wind. It is definitely more of a feel. I could tell on an instant that it's 8-10 knots or 12-15 or whatever just having the wind touching my face. The weirdest is the ability to kind of predict that a gust is coming through, or being able to tell that a gust might last long enough for me to gain speed. I couldn't describe any explicit physical signs. It is the overall feel while holding the wing with feet on the board. I was reminded again and again that it is not really about \"knowing\" how to do a certain thing. It was all about muscle memory. My body had to learn to do it, not my mind. The best example was that I knew how to do one thing goofy, but just couldn't bring myself to come close to doing the mirror direction. I knew exactly what to coach myself and understood what I was doing wrong. But my body kept repeating the same mistakes. It took time for the muscle memory to kick in. It was both frustrating when I was failing and the best feeling when my body intuitively knew what to do. Learning in low wind. Learning to foil in low wind is difficult to impossilbe. I made this mistake many times. I was frustrated during those sessions. However, my wing handling skill was a lot more advanced compared to otherwise. In a roundabout way, I am much better prepared for windy conditions, say 25-35 knot conditions. I am comfortable handling the wing to sail without foiling and maintaining upwind ground. Even in strong wind, I could just stand there and pull in only as much or as little power as I need to go where I want to go. As long as I am able to do that, I feel safe and comfortable in any condition. This peace of mind allows me to go and have fun regardless of wind conditions or shoreline hazards. It is best to learn in medium wind conditions. 15-20 knots are probably best. Anything less than 15 knots, one has to pump to get up on foil. Pumping works best with both wing and board combinations. Once the wind approaches 25, the wing could feel over-powering. I would drift downwind quickly unless I am up sailing. It could be overwhelming to someone who is not experienced. I started learning on Alameda beach because the unidirectional wind is always blowing onshore. I walked along the beach to gain upwind ground when I was not able to return to the same spot sailing. Once I was comfortable maintaining upwind ground, I started going Berkeley Marina. It is a windier spot. It has nice facilities for me to set up and clean up before and after wing sessions. Jibing and riding toe side. Riding toe-side is completely different. I once thought that once I learned to ride heel-side both directions well, I should be able to ride toe-side easily on my goofy stance. I was wrong. Riding toe-side requires completely different balancing skills even though I know how to hold the wing and ride in the direction that I want to go. Toe-side riding requires yet another set of muscle memory. There is no shortcut. Footnotes The best being fort point . ↩","tags":"misc, , water, foiling","url":"/2023-06-25-wing-foiling-tips","loc":"/2023-06-25-wing-foiling-tips"},{"title":"Domain Specific AI Assistants","text":"ChatGPT has become one of the most popular AI assistants. It is a fixed, general purpose model. That is by design because it learns from mostly texts in the public domain. This applies to open source LLMs as well, e.g. T5 , LLaMA , and falcon . However, there are many reasons to build AI assistants that specialize in limited knowledge domains. For example, I might want to talk an AI assistant specific to home purchasing, an assistant is limited by a company's private corpus of documents, or a personal assistant that has access to my private emails, texts, and calendar. In this post, I will discuss the techniques relevant to building a domain-specific AI assistant. An LLM accesses private information in one of two ways. The information could be baked into model parameters, and then the chatbot retrieves relevant information through prompting. This requires the LLM to be trained on private data after it acquires its general comprehension capabilities. The other way is to provide the domain specific knowledge as input context. The information retrieval step is performed outside of the language model. It could be calls to external APIs or searches of a document store. Knowledge Infusion Through Model Training LLMs could answer a wide range of common questions directly. Their breadth of knowledge could be comparable to a search engine. LLMs memorize information by applying some pretraining objectives to a large corpus of data. The data sources are usually a mixture of internet text, books, and other various public text corpus. It is important to note that an LLM only loosely absorbs information from the training corpus. There is no guarantee that any one piece of data will be 100% burned into model parameters. Even the largest neural model is not likely to incorporate information in any predictable or uniform way. There are estimates about those information retention rates. For example, CIJ+23 quantifies the some lower bound to be 1% . It is impossible to exactly quantify this retention rate. The memorization mechanism is hard to know for sure. We only know that a large neural model could learn from simple pretraining objectives. PSC21 discusses the limit of knowledge acquisition through masked language pretraining objectives. We are still early in our understanding of how LLMs work and how to properly perform precise memory infusion, extraction, or modification. There are many unsupervised training techniques to infuse knowledge and capabilities to LLMs. The most common is the various forms of masking spans. For example, next token prediction, next sentence, and random masked spans are the most popular and common pretraining objectives. Pretraining objectives: different span corruption strategies ( TDT+23 ) If the private domain contains structured knowledge, e.g. knowledge triplets, one could devise a masking strategy that focuses on key concepts and relationships ( SWF+21 and MDAJ22 ). Pretraining objectives: structured knowledge infusion and universal knowledge-text prediction Other objectives could be predicting sentence ordering. The model is asked to recover the correct ordering from a randomly permuted sentence order. Another objective could be sentence distance. A model takes two sentences as input and predict the category in the set of adjacent, nonadjacent but same document, and not in the same document. Other than using unsupervised training objectives, it is also possible to update the model parameters through the use of synthetic data with a task-specific fine-tuning objective. Some LLMs, either the one undertraining or an external LLM , could be used to generate input and target pairs from raw texts. The input and target pairs could be in the form of question and answers, question and supporting passages, queries and topic classifications, etc. The model is then trained on these labeled data. Fine-tuning on synthetic labeled data ( SFKS+23 ) LLM Information Extraction Techniques There are different techniques to extract information from model parameters. The most obvious approach is through prompt engineering. The most well-known prompt techniques are few-shot learning and chain-of-throught prompting. Prompting: few-shot ( BMR+20 ) and chain of thought ( WWS+23b ) There are additional variations. For example, the few-shot could be combined with synthetic example generation. The few examples could be modified to be a recitation-augmented prompt, as proposed by SWT+23 . Prompting: few-shot with recitations Other than better prompts, there are other ways to improve information extraction from model parameters. WWS+23a proposes sampling multiple outputs and choosing the most consistent answer from the set. One could also ask the LLM to produce multiple relevant recitations about the queries and use those recitations as context for the LLM to make a final output. Sampling multiple outputs WWS+23a The sampling and prompting techniques could be combined in any permuted order. For example, one could prompt the LLM to sample multiple recitations and keywords that describe the input query. The recitation passages could be deduplicated by or combined keywords before being used as examples in the final prompt fed to the LLM . In many ways, once an LLM completes learning what it could through its training stages. We can use prompt engineering techniques or modify the last layers to direct the LLM to squeeze information from its parameters and explore different outputs. Retrieval-Augmentation The easiest way to create a domain specific AI assistant is using a retrieval step to add a context to the prompt. This approach is commonly described as retriever-reader in the open book question-answering literature. This framework works well for AI assistants as well. The foundational model provides general reasoning and comprehension capabilities, and the information required to answer user queries is provided in the context. The key advantage of this approach is the retrieval step is very predictable and explicit. It is a much more efficient and reliable information retrieval procedure than prompting an LLM to find information stored in its parameters. However, the context window could become a limiting factor. The responses is likely to be superficial and read like a summary of the provided context. The LLM could not draw patterns, nor could it synthesize data from the entire corpus relevant to the query. The enterprise knowledge chatbot created by Glean follows this pattern. The retrieval step is text search, which could use semantic embeddings powered by LLMs or vanilla lexical search. See my previous post on document search . There are additional ways to modify this setup. For example, ZYD+22 proposes appending synthetic data to be indexed with the raw document to improve retrieval effectiveness. External Tools The retrieval step is an example of using external tools to complement the knowledge implicitly stored by the LLM . OpenAI's plugin and function calling API is another example of piecing together external tools to be part of an AI assistant. The basic framework is simple. Before the LLM produces the response, it determines whether it should perform external API queries. The set of available APIs could vary from application to application. The LLM can turn the original query into inputs to external APIs, and the API 's responses are converted into context as part of the final prompt. SDYD+23 and PGH+23 test similar schemes to validate zero-shot performance across a variety of NLP tasks. Such a component could be built and patched to any open-source LLMs. There are two key NLP tasks that this component requires. One is a classification task on whether the LLM needs to use the external API given a query. The other task is transforming the query into compatible API inputs. This framework extends the AI assistant's ability to both retrieve relevant data and perform actions on behalf of the AI user. Final Thoughts AI assistants had not been very useful until the recent progress in LLMs. However, ChatGPT is just the beginning. I expect that more and more applications will enable meaningful and easy-to-use assistant user interface. These assistants won't be general-purpose the same way that web search or ChatGPT are. These assistants will be limited in scope and domain-specific to the application. Chatbot style interaction is a natural extension to an search box, which has been a standard feature for the past decade. The next generation of applications will use chatbot as the interface for both information retrieval as well as performing actions. Footnotes","tags":"misc, , ai, LLM","url":"/2023-06-04-domain-specific-ai-assistant","loc":"/2023-06-04-domain-specific-ai-assistant"},{"title":"Open Source LLMs","text":"Large language models ( LLM ) are becoming an increasingly requisite component for modern applications. The text generation capability has crossed a threshold to perform many well-defined NLP tasks as well as human workers could. Many of the recent applications in the past year, in 2022-2023, leveraged close-source, privately hosted LLMs. Many of the best, largest state-of-the-art models are close-source and controlled by private companies such as OpenAI, Cohere, Adept, Anthropic, and Google. There is also rapid progress in the open-source community. At it stands today, open-source models are not likely to be able to match the capabilities of private LLMs due to training data, compute costs, and engineering resources. However, the open-source models could be sufficiently powerful for most applications. Open-source models have their advantages as well. For example, developers could choose a specific model size to match the application requirements to reduce compute waste. The open models could be further modified and trained for specific domains or NLP tasks. In this post, I am sharing my thoughts on how to get started on choosing an open-source LLM . Ingredients of LLM I am going to describe the key factors that differentiate LLMs. The best way to gain a decent understanding of LLM is to read a few of the classic papers on the topic. I would recommend the original transformer paper( VSP+17 ), the BERT (Bert DCLT19 ), and UL2 TDT+23 , and the instruct-gpt paper( OWJ+22 ). The first thing to notice about an LLM is its model architecture. All the recently published, relevant LLMs are transformers-based. That is, the key building block is transformer. Here are two excellent tutorials that describe this building block: 1 and 2 . A transformer could be loosely understood as building a cross-attention, n -by- n matrix with the size n equal to the input length. The matrix tells the model how to represent each token based on the entire context. The full model has many layers, with each layer having many different multiple heads of these transformer units. One could perceive a transformer unit as a unique logical way to evaluate the input. For example, one unit is to evaluate the input's language structure, the other unit is to evaluate the input's historical context, etc. There are two major variants of transformer-based LLM that are popular: decoder-only and encoder-decoder. Encoder-only could be subsumed by the encoder-decoder model because the decoder part could be discarded for specific downstream tasks. However, from the perspective of LLM users, we don't have to worry too much about the pros and cons of the different variants. The only key distinction is that the decoder-only model concatenates inputs and targets to process them together. This would make decoder-only models less appropriate for applications that require text embeddings. An LLM 's size is measured by its number of learned parameters. Many models are trained with multiple sizes for ablative experiments. The largest model is the most powerful. However, they could be expensive to deploy. For example, a medium size INSTRUCTOR model might be sufficient for a semantic search application. The biggest is not always the best. We could choose the model sizes based on trade-off of cost, computation, and model performance, Another LLM feature is the size of its context window. One of key limitations of transformers is quadratic memory and computation requirement with respect to input length. This limits the input length and the context window. There are variants of transformers, such as long-formers, ETC , and long-t5 that scale linearly to input length. However, there aren't any powerful, truly large LMs that are fully trained in those architectures. If long context window is a requirement for your applications, you might have to adapt your model and train from scratch. However, pretraining from randomly initialed weights is very expensive and require a fairly substantial amount of engineering know-how. Another key ingredient is the pretraining objective. LLM gets its reasoning capabilities and knowledge through processing massive amounts of text. The key idea is to hide some part of the known text for each sample, and then ask the model to predict missing span. The objective is to score the prediction, and then use the score to calculate parameter gradients. There are different strategies to generate masked texts. The most common objectives are left-to-right span corruption (e.g. next token prediction), prefix + span corruption, random span corruption, or some combination of these techniques. For practitioners, we care more about the capabilities of the trained models, but less so about the exact training objective. However, if we need to adapt a pretrained model to another domain, we would have to program the objectives to allow the model to process additional corpus. Another ingredient is the training data. Pretraining data is unlabelled text. Fine-tuning data is labelled dataset, and is much smaller than the pretraining dataset. It is important to understand what data have been used for the pretrained checkpoints. This allows us to understand what domain the model could perform well, what knowledge the parameters could contain, and how to further improve the model. Another ingredient is input representation. The input representation is learned from the text corpus. We need to consider if the application domain has similar vocabularies. It is usually not a problem for LLMs that are trained on large, sufficiently diverse corpus. Lastly, we have to consider if the models are trained by reinforcement learning through some reward models. The technique of furthering training fine-tuned models to have more human-preferred outputs is known as reinforcement learning from human feedback ( RLHF ). It is different from fine-tuning with human-labeled data. RLHF takes labeled data to train a completely independent reward model. The reward model is used to evaluate outputs generated by the fine-tuned model. See SOW+22 for more details. This step has been shown to limit hallucination and optimize outputs for human preferences. However, this step also restricts the model's variance and makes it more likely to generate similar, mundane outputs. RLHF -trained models are hard to further fine-tuned or modify for specific tasks. Start with a Pretrained LLM The practical way of using an open-source LLM is to choose a pretrained checkpoint from well-known models. There are many overviews and surveys about LLMs. For example, YJT+23 provides a good overview of the history and the latest LLM models. As of the writing of this blog, I would consider one of these as a starting point: t5 ( RSR+20 ), long-t5 ( GAU+22 ), flan-t5 , flan-ul2 , pythia ( BSA+23 ), dolly , instructor ( SSK+23 ), LLaMA ( TLI+23 ), falcon . I would also consult various well-known benchmark leaderboards on LLM and specific NLP tasks. For example, open LLM leaderboard , MTEB , HELM , and Chatbot Arena There is no one-size-fits-all step-by-step guide on how to choose a foundation model. I would start by understanding my application's requirements. I would consider the application's expected input and output. For example, I would ask some of these questions: what are the typical lengths of the application's queries, do queries require supplementary contexts, are the outputs long form answers, etc. These questions would guide me to find a foundation model that best matches the characteristics of the application. I would also consider the desired NLP tasks and the model's strengths and weaknesses. For example, depending on if it is a text search application or an AI assistant, I would choose an encoder or a decoder-only LLM . Customizing Models Once I choose pretrained checkpoint, I would consider how to further modify the model to fit my application. I could modify the last layers to target a classification task. I could modify the token search algorithm to allow for an output distribution as opposed to always selecting the output with the highest probability. I could discard the decoder component and only use the encoder to generate text embedding. If the application domain is very different from the text corpus used for pretraining, it could be appropriate to train the model from scratch. This might be viable for small models, but for models that cross the billion parameters threshold, both the GPU compute and engineering resources would be prohibitively expensive for most small teams. An LLM could be further customized by fine-tuning with high-quality datasets, e.g. dolly-15k or flan , that were not used to train the model. I could fine-tune the model with private data. I could set up a reinforcement learning loop to instruct the model to have more human-preferred outputs. 1 I could further train the model on its pretraining objective on a private corpus for domain adaptation. The model could also acquire additional knowledge from processing large amounts of domain-specific text, allowing the model to answer queries with information that might not be explicitly included in the context window. Model Deployment Training LLMs is very expensive, but even model inference is not cheap. Every prediction requires a forward pass. That passthrough needs to touch every parameter. For a 100 billion parameter model, that is 100 billion floating point operations at a minimum. A sentence response is as many passes as the length of the sentence. For latency reasons, the parameters need to be pre-loaded into memory, which is likely to be in the 10s GB range. While operations could be performed by the CPU , the matrix computation should ideally be performed in the GPU . Even for a moderate-size LLM , the inference model needs to be supported by a GPU with 10s GB of memory. 2 Footnotes This could be brittle and should only be taken when there are sufficient resources to collect human feedback, set up the experiment, and evaluate the model. ↩ I am not going to discuss model sharding at this post. It might be a topic for future post. ↩","tags":"misc, , ai, LLM","url":"/2023-04-27-open-source-llm","loc":"/2023-04-27-open-source-llm"},{"title":"Pretrained LLMs and Text Search","text":"Pretrained transformer-based large language models ( LLM ) are upending our perception about what AI is capable in language comprehension. While large, open-domain search engines have been able to process natural language queries, the search box feature in all other applications, such as site search or search in enterprise applications, have been limited to keyword based. The emergence of LLM is changing that. Both consumers and enterprise application users will soon expect all search boxes to understand natural languages. In this post, I will outline my thoughts on what is next for text search. Text Search Paradigms I am not writing a comprehensive literature review on information retrieval and LLM . One could refer to LNY21 for an overview. I only intend to give a short review of the techniques that are relevant to implementing a search system as technology stands today. Lexical Search I use lexical search loosely to refer to a variety of text processing techniques and term-based relevance scoring algorithms. The best known models are BM25 and tf-idf . These models encode documents as sparse vectors, storing them as inverted indices. The search algorithm converts the query into a vector to calculate a relevance score. Learned Representation Instead of relying on lexical structures, learning models are used to discover a latent feature set to represent the documents and queries. The learning models could be linear models, tree-based models, LSTM (long short-term memory), or transformer-based LLM . The learned representations could be sparse vectors. The search algorithm treats the sparse vectors similar to how lexical search makes use of inverted indices. The model architecture has a hidden layer that corresponds to the latent features representting the input text. The rest of the model could be any learning models. The model is trained to have a sparsity as well as relevance objectives. The trained model could be used to encode documents and queries into a vector space. See ZDC+18 for an example. The learned representations could be dense vectors. While sparse vectors could have dimensions in the millions, dense vectors usually have dimensions in the 100s or 1000s. The approach of dense vector representation existed well before LLM . Vector encoding generated by word2vec ( MCCD13 ) and GloVe ( PSM14 ) were examples of dense embedding. These models encode an input text into a dense vector with length in the 100s. Text search algorithm applies a distance similarity and nearest neighbor search on a vector space. Note that dense vector similarity algorithms could be much more efficient than sparse vector search. However, sparse vector algorithms such as BM25 are usually sufficiently efficient that it is not a major concern. See this blog post for an example of approximate nearing neighbor search for dense vectors. LLM -based representations work exactly the same. The only difference is that LLM is used to encode the documents and queries. LLM dramatically improves search results if the LLM is trained on a corpus similar to the later queries. See KOM+20 and XXL+20 for detailed descriptions on how to build these models from BERT ( DCLT19 ). LLM -based embeddings, e.g. OpenAI embedding , have lengths in the 1000s. Cross Encoding Ranking Model Both sparse and dense vector models are representation-based. They separate the computations between indexing and query. Cross-encoding requires both the query and the document to be provided as input to calculate a relevance score. They are also called interaction-based models. It is possible to use LSTM or ConvNet to encode the query-document pair. The biggest gain has been using pre-trained LLM as the starting point for model training. The query and document are concatenated as the input. An LLM encodes the input and is trained as a binary classification model. Once trained on a labelled dataest, the model could be used to give a relevance score to any query-document pair. See NC20 and NYCL19 for examples. Transformer-based LLM is O(n^2) with respect to the input sequence length. Cross-encoding has to process every query-document pair as input. That is hugely inefficient. One way to get around this is late interaction (see KZ20 ). This technique uses two separate LLMs for query and document, thus mitigating the n^2 issue with the concatenated input. It also allows for an opportunity to cache the learned embeddings. The interaction could be calculated at query time. Current State of Search Performance is a tricky topic. The LLM hype makes it seem as if LLM techniques vastly outperform traditional methods such as BM25 . The problem is that all evaluation methods are imperfect. Each model could do well in some evaluation methods but perform purely in others. A detailed discussion of performance requires the introduction of benchmarking datasets and methodologies. I will not go into those details in this post. See TRR+21 for an example. This video offers a good introduction. I would offer the following key takeaways: BM25 is a robust baseline. It generalizes well to out-of-domain data. LLM -powered dense vector model has superior in-domain performance. However, it does not generalize well. Cross-encoding ranking models are computationally expensive, but they perform the best. The tried and true lexical BM25 is the workhorse of the majority of search applications. Aside from massively well-funded solutions like Google and Bing search, the majority of applications implement lexical search. Lexical search is easy to understand, has a robust baseline, and generalizes well to out of domain datasets. It is already supported by mature technologies such as Solr and Elasticsearch. There is a large base of developers who have the expertise to build sophisticated search systems. There are also well-established hosted solutions such as Algolia and Elastic. LLM -powered embedding is getting all the hype. There are a few reasons for that. First, it enables semantic search, which is something that users intuitively understand. Everyone wants a search capability that could retrieve relevant documents that do not necessarily contain the exact keywords. LLM dense vector provides that feature easily. Second, there is a perceived performance boost. While this is not always the case, the hype is already here and everyone wants to at least says that they support semantic search. Third, vector databases are becoming a hot trend. There are new vector databases such as Weaviate and Pinecone. Traditional databases are starting to support vector similarity, e.g. Solr, Elasticsearch, Redis, and Postgres. Fourth, AI companies provide hosted LLMs via https APIs to convert documents to dense vectors. Examples are OpenAI and Cohere . They allow developers to apply dense embedding techniques without needing to even understand LLM . Cross encoding models are not popular because they are not only computationally expensive, but require custom implementations. Unlike lexical search and dense vector methods, there are no established services or open-sourced libraries to support cross encoding algorithms. Only the best funded enterprise applications have the resources to integrate those techniques in their search features. Applications would definitely want access to state-of-the-art search capabilities. The development cost is too high. Beyond Lexical It is inevitable that search applications will move beyond lexical search due to the emergence of LLM -based techniques. Pretrained LLMs have demonstrated enormous capabilities in understanding languages. Search techniques will evolve to harness its power soon. What People Want Most application developers blindly apply the dense vector similarity technique as if it universally delivers state-of-the-art results. Embeddings only work well if the application domain is similar to the pretraining and training training data passed to the LLM . I would want a search system that is more than lexical but also more than just embedding similarities. The search system must allow for domain-specific fine-tuning and pre-training of the LLM . The system has a two-stage ranking pipeline. In stage one, it uses a combination of lexical and dense vector representations to efficiently retrieve candidate documents. In stage two, it uses a cross encoding model and other heuristics to rerank candidate documents. Implementation Challenges A system such as described above is not easy to build. A lexical search system could be as simple as entirely relying on an Elasticsearch deployment. There is no out-of-the-box solution for implementing the hybrid system. Below I list out some of the key challenges. The challenges are not insurmountable, nor do they require theoretical breakthroughs. They merely require engineering resources. Using hosted LLM API works well for small datasets. GPT4 costs $0.12 per 1000 tokens 1 . Assuming an average of 4 bytes per token, 1MB of data would take $30 dollars to process. That is impossibly expensive! Most search applications will not be built on top of costly LLM APIs. Costs aside, indexing tasks are usually the key bottleneck of a search system. Relying on third-party LLMs as a key step for the indexing stage would require piping the entirety of the raw data outside of the system. It raises issues with regard to performance and data privacy. The two-stage pipeline has to be custom-built. Reranking algorithms must be custom software. Lexical search and dense vector nearest neighbor algorithms and toolings are widespread, but cross encoding models do not have plug-and-play implementations. The search system also needs to link together different ranking stages. There needs to be a robust data pipeline for indexing, storage, and retrieval. For lexical search, tools such as Elasticsearch handles all of these tasks as a single piece of software. Developers only need to interact with its APIs to index and search. A hybrid search system needs to integrate multiple pieces of database technologies to store indices and documents, and implement the search algorithms with respect to the databases of choice. Final Thoughts LLM marks a watershed moment in computer programs' ability to understand natural language. For the majority of applications, search has been stuck in keyword matching. That is going to change in the near future. Users will expect search boxes to understand natural language. Product thinkers will want to design search and Q&A boxes that match those expectations. However, semantic search is harder than just relying on OpenAI APIs and a vector database. Application developers are still waiting for toolings and hosted solutions to build high-quality, next-generation search experiences. Footnotes As of April 2023. ↩","tags":"misc, , ai, LLM, search","url":"/2023-04-04-document-search","loc":"/2023-04-04-document-search"},{"title":"Scaling NFT Metadata","text":"The NFT Metadata Scalability Problem In my previous NFT platform post , I give some examples on why on-chain metadata is required for future NFT behaviors. There are many new token standards that are under considerations that required additional metadata, such as Subscription Token Standard, Entangled Tokens, Micropayments Standard, Re-Fungible, etc. Gamefi will most likely push the boundary of on-chain metadata. NFT could represent metaverse player characters and items. Their behaviors could be governed by on-chain NFT state transitions. Increasing complex on-chain behaviors require increasing high volume of metadata. A standard implementation of ERC721 contains the following data on-chain: owner address, name, and symbol. The state transition function only depends on the variable owner address . A common extension includes an off-chain metadata specification, where each token id also contains a tokenUrl . The url is a link to some off-chain bytes. These off-chain bytes are usually required to be a json file that contains key-value pairs such as description , image_url , etc. Those key-value pairs are metadata. NFT state transitions operates on on-chain metadata. For example, we want to program an in-game NFT representing a property in a game such that it is only transferable if it has a house built on it. This simple verification check could not be programmed against the metadata has_house: bool . It is possible to get around it. For example, if there is a on-chain merkle hash representing the off-chain metadata, the state transition could take a merkle proof as an input argument. This is complicated. It introduce the issue data availability for off-chain metadata. Any metadata-update would still require an on-chain update of the merkle hash. With on-chain metadata, programming a state transition is straight forward. On-chain metadata has not been popular because of cost and existing behaviors. Storing data on-chain is expensive on smart contract platforms. The off-chain json metadata standard is a work-around solution. It introduces a user experience problem and third party service dependency. There is only one popular form of NFT usage pattern today, and that is ERC721 . It uses limited set of metadata, e.g. owners , approved_address , and approved_operator . Future NFTs will be an ever increasing variety of mini-programs that require metadata that goes well beyond what we are seeing today. The usage of on-chain metadata will grow in orders of magnitude as NFTs usage grow both in size and complexity. A blockchain has limited space for on-chain metadata. Every blockchain will at least be limited by two factors: blockspace and state size. Blockspace corresponds to how much data the blockchain could fit through its various components, e.g. gossip network, mempool, execution, and consensus algorithm to be recorded in blocks for perpetuity. Blocks contain transactions, and transactions change and append to the state. If there are a steady stream of append only metadata transactions, the state could grow indefinitely. Both historic blocks and state take up local storage. \"State size limit is the limit of how much metadata could be accumulated overtime\" Some network participants, usually the validators, have to maintain the active state of the blockchain to process transactions. They need to read and write to the state to process transactions. This sets a limit on the state size. The state size is limited by the amount of storage available to the validator node. Even if a chain place a high end hardware requirements to its validators, the single machine state size is about a few hundred gigabytes to a few terabytes, depending on implementation details. On-chain metadata accumulates into the state. A blockchain could have unlimited throughput in all other components, but its accumulated state cannot go beyond its single machine limit. There are two solutions to expand the state size beyond a single machine. One is state expiry and another is state sharding. The first strategy prunes the old data from the state. The strategy has to allows expired data to be recovered. One way to do that is that anyone could active the old state by providing the data and a validity proof. This approach has few drawbacks. It introduces a dependency to a third party service to hold the expired data, it does not scale executions, it does not scale throughput, and it is not user friendly. The second strategy is sharding. When the state grows beyond what could be handled in a single machine, the system use more machines. The key challenge is how to shard the state so that each shard could process transactions in parallel. Cost of Onchain Metadata There are fundamental reasons why on-chain metadata is always going to be somewhat expensive. The ecosystem has many convoluted discussions on high transaction fee. Every chain claims that they could lower fees to negligible level. Each virtual machine has different rules on how to charge fees. Regardless of those internal accounting mechanisms, the hard truth is that there are fundamental costs to on-chain data storage. There are two major categories: resource cost and market premium. Market premium is due to supply and demand. When the throughput of an open blockchain's virtual machine has abundance of free capacity, the fee would match the costs that network operators contributed to running and securing the network. When there is high demand for the limited capacity, there is going to be a premium surcharge. Network participants commit resource to be part of the network. They are rational actors who are expected to at least recuperate their investment. Different blockchain mechanisms require varying levels of capital commitments. Broadly speaking, they are server cost and capital cost. Server cost includes the cost of hardware, network, electricity, physical property, and maintenance. Capital cost is not always needed, but it is an essential part of proof of stake systems. Server Cost Capital Cost Proof of Work high n/a Proof of Authority low n/a Proof of Stake low high Rollup Chain (Proof of Stake DA ) low medium Proof of work chains require many expensive mining servers to maintain the network. Hardware and electricity cost are high. That cost has to pass on to the chain users. There are many estimates that put the cost of 51% attack of Bitcoin or Ethereum at the $10s billion range. We assume that amortized cost is $1 billion annually, the on-chain state accumulation limit is 100GB , and each data point is 100 byte. A per data point cost would be $1.00. This level of fee would completely eliminates any meaningful NFT use cases that we spoke of earlier. Note this ball park estimate ignores many of intricate details of a blockchain's fee structure. However, it is not far off from real world data. For example, 128 byte in Ethereum with ETH price at $1000 would be $2.4. Proof of stake chains have a high requirement on capital. The server cost of a PoS chain depends on how many nodes need to maintain the state to process and validate the transactions. This is not going to be significant. Even if there are 10,000 validator nodes with each costing $2000 per year, the total cost is only $20 million. The capital cost is mucher higher. For example, Ethereum has a $20 billion of ETH staked for the Beacon Chain. Other prominent chains are also in the range of billions or above, e.g. Solana at $15 billion, Near at $2 billion, and Polkadot at $5 billion. Assume that the staked capital expects a 5% annual return. The annual cost is $1 billion for Ethereum. This cost is similar to proof of work. It might be more environmentally friendly to switch from proof of work to proof of stake. But the ultimate cost that the chain users have to bear might be similar because the staked capital needs to be compensated. In the short run, they could be rewarded by network value inflation and growth. In the long run, network users has to pay for cost of securing capital. Proof of authority chains could have a low fundamental cost. Let say the chain only has 7 validators. Each server cost $10,000 annually. The total cost is only $70,000. It does not have any additional capital cost. The drawback is centralization risk. The nodes operators are fixed. These node operators could be hacked 1 . Rollup chains has a very low cost due to validity, and the cost of data availability ( DA ) varies depending on its trust model. The fundamental cost of a rollup chain is mostly on data availability. A rollup chain has a few key costs: server cost of sequencers, state transition validity, and data availability. The server cost of sequencers is negligible because it operates with a 1/n security assumption. The network only needs a single honest sequencer, or it could be zero if user is allowed to submit transactions to the main chain. The cost of state transition validity is also negligible because it is a fixed cost that could be amortized over 1000s or 10,000s of transactions. The key cost that scales linearly with the amount of on-chain data is data availability. The cost of data availability depends on how that data is secured. For example, if it is proof of authority, it is cheap. Solutions that are termed data availability committee ( DAC ) is just another name that the data availability layer is secured by proof of authority. It is harder to make ball park estimate on the fundamental cost of how dedicated DA chains translate into on-chain data cost. For example, say a rollup chain wants to accumulate 100 GB of on-chain data. Let's assume that it corresponds to 1 TB of transaction data that have to be post to the data availability layer. It is important to note that this 1 TB could be split up and erasure coded, where each data chunk is assigned only to less than 10 storing nodes. They do not need to be replicated hundreds of times to guaranteed persistence even in a completely open, decentralized network. Depending on the desired level of security, this could be secured by $1 billion or $10 million. That is the deciding factor on how costly this component is. It is important to note that for a fixed stake, say $1 billion, the more data it is pledged against, the less secure the individual data chucks. The reasoning is simple. The stake has to spread to more data nodes, the cost of getting slashed for reach node is smaller. It lowers the cost for an attacker to bribe the necessary nodes to censor or delete a particular data chuck. Low Cost, Scalable Onchain Metadata Future NFT use cases will demand a low cost, horizontal scalable on-chain metadata and state transitions operating on those metadata. The use cases will be especially prevalent in metaverse type applications. Deploying NFTs as smart contract on a sharded blockchain is suboptimal. First, a general purpose sharded blockchain usually do not allow smart contracts to control shard location. Shard control is useful because it could take advantage of the simplicity of the NFT data model to shard state and corresponding state transitions. For example, it could have split the NFT state into primary and secondary shards. The primary shard involves state transitions that need to be synchronized with to the NFTs. the second shard only involves in self-contained state transitions. Second, existing general purpose, proof of stake chain is expensive. Those costs are fundamental to how cost of securing the network, and hence they will not go down in the future. Use cases of programming on-chain metadata should be cheaper than transaction that involves value exchanges. A general purpose chain is not likely to offer those low cost transactions because it would require different shards to have different cryptoeconomic security levels. Third, working with smart contract does not lead to a great developer experience. Game developers who want to integrate NFTs should not need to learn the intricacies of writing and deploy smart contract. Smart contracts are not yet mature, and bugs to lead to direct loss of funds. The developer experience of programming and using NFTs should be through APIs and SDK . A proof of stake, sharded NFT blockchain is a straight forward way to remedy the challenges mentioned above. The NFT data model is much easier to shard than a general purpose virtual machine. The sharding design could accomodate both NFT specific or cross-shard transitions. The network could put different cryptoeconomic requirements on different shard types to fine tune the tradeoff between cost and security. It should be noted that a proof of stake, sharded NFT chain could be implemented as a sharded rollup chain. The core of the two chains are the same. They could have the same sharding model, same set of transaction capabilities, and the exactly the same APIs and SDK . The key difference is on how to they maintain data availability and state validity. In the case of proof of stake chain, each shard is assigned a set of staked validators that guarantee both data availability and state validity. In the case of a rollup chain, state validity is through validity proof, and data availability must be handled by a separate network. A sharded rollup chain is a valid approach to deliver low cost, horizontal scalable on-chain metadata. It should be noted that data availability solutions and validity proof are not as mature as the proof of stake technology. Proof of stake technology is still a complex technology, but it is fairly well known and is already in wide spread production use. There are already many mainnet proof of stake blockchains that built on cosmos-sdk and tendermint. On the other hand, dedicated data availability chains are still a work in progress [celetia, ethereum data shard]. For validity proofs, an application specific rollup chain would have to write its own validity proof system. A generic zk proof system such as Starkex only support limited use cases, e.g. ERC20 , ERC -721, ERC -1155. The platform would reequire a a zk development framework that allow users to easily write on-chain verifiers and generate off-chain proof for arbitrary programs. Final Remarks I believe in a world where future use cases will be many order of magnitudes of what we are seeing today. They will be mini-programs that govern interactions in gaming and in media. These programs will require on-chain operations that could only be satisfied with a horizontally scalable solution. On-chain data is always going to relatively expensive because their availability and validity have to be guaranteed by some kind of scarce resources. That scarce resource could be staking tokens, hash rate, or trust on central authority. Proof of stake is probably the best compromise, but cost of capital is going to pass on to the users. It could still be prohibitively expensive for some NFT use cases. The NFT blockchain should be able to differentiate that different behaviors has different security requirements. For example, updating the residence zip-code of a metaverse character should not be as expensive as trading an NFT . There will be dedicated NFT chains that allow the developers to control those parameters and adjust the fee structure to match their NFT use cases. Footnotes Ronin bridge is an example ↩","tags":"misc, , web3","url":"/2022-10-01-scaling-nft-metadata","loc":"/2022-10-01-scaling-nft-metadata"},{"title":"Notes on NFT Platforms","text":"A non-fungible token ( NFT ) represents ownership of a non-interchangeable unit of data. NFT is a powerful primitive because it enables natively digital ownership. Existing use cases of NFTs are still at the beginning of this evolution. The ERC721 standard has only a few functions that are programmed primarily on the ownership variable, allowing users to create, own, and trade their digital assets. Future NFTs will be programs of a more diverse and sophisticated set of state variables. For example, NFTs could represent a metaverse playing character. The player's history, status, and relationships could be represented by the NFT state and state transitions. The NFT 's program defines how the character could or could not interact with its environments. NFT platforms are the decentralized infrastructure where NFTs are minted, updated, and traded. Platforms are different from NFT marketplaces. Users buy and sell NFTs in marketplaces, and marketplaces are built on top of NFT platforms. Examples of NFT marketplaces are OpenSea, NBA Top Shot, or Autograph. The core of an NFT platform is a blockchain, the decentralized database where NFTs are defined and created. NFT platform could include more offerings, such as SDK , bridges, managed services, etc. Examples of NFT platforms are Ethereum, Solana, and ImmutableX, and Stargaze. In this post, I describe the NFT data model as a foundation to understand NFT platforms. I then compare and discuss the features of NFT platforms: blockchain type, programmability, scalability, media storage and delivery, and developer experience. NFT Data Model NFTs are simple, specialized state machines. A state machine has state and transition functions. An NFT has three types of data: an owner, metadata, and the unique media content. An NFT has an owner. Although it is not required, most NFTs are uniquely associated with some media content. An NFT also has a set of state transitions that define the capabilities of the NFT . Let's make these ideas concrete. Let's take ERC721 as an example. A typical ERC721 implementation contains the following on-chain data: token_id , owner address , name , and symbol . All the public and private functions only depend on the variable owner address . A common extension includes an off-chain metadata specification, where an NFT also contains a token_url . The url is a link to some off-chain bytes. These off-chain bytes are usually required to be a json file that contains key-value pairs such as description, image_url, etc. These key-value pairs are metadata. It should be noted that the on-chain data points name and symbol are also metadata. The third data type is the actual media content. It is referenced as a special off-chain metadata in the image_url key. Its value is a yet another link, pointing to the off-chain location where the media content is stored. The distinction of on-chain variables and off-chain metadata are important because off-chain data is not programmable. For example, an in-game NFT representing a property such that it is only transferable if it has a house built on it. This simple verification check could not be coded again the metadata has_house: bool 1 . Other than programmability, off-chain metadata introduces unnecessary dependencies. The data's availability is not guaranteed by the platform but by a third party service provider. For example, one popular solution is to pay for Pinata's pinning service to ensure that the media content is available through a IPFS query. Off-chain media content is not programmable. For example, someone wants to program an audio NFT such that the first 1000 listeners would get rewarded. The NFT 's on-chain state machine only contains a reference to its media. The media is stored and delivered by a third party service, and the service does not talk back to the state machine to inform who and how the media is used. Comparing NFT Platforms Future NFTs will be mini-programs that does not just operate on ownership data but also on arbitrary metadata and the usage behaviors of associated media content. For an example in gaming, monster cards could be combined to form a more powerful card. Those rules will be on-chain transition functions. For an example in the creator economy, a song represented by an NFT could monetize directly if the song becomes popular and garner millions of streams. I will discuss the key aspects of how to evaluate NFT platforms: blockchain type, programmability, scalability, media storage and delivery, and developer experience. Blockchain Type An NFT platform is a place where the NFT data model could be deployed. There are two common patterns to accomplish that. The first pattern is via smart contracts. The second pattern is to build application specific blockchains that natively encodes the NFT state machine. Any general purpose L1 and L2 chains could be used as a NFT platform. Examples are Ethereum, Solana, Flow, Polygon, Optimism, and Starknet. These blockchains do not natively understand the concept of NFTs. Developers implement the state machine that encapsulates the NFT data model. In practice, the community of a blockchain agrees on some standards that correspond to basic functionalities of NFTs. The second type of NFT platforms are decentralized protocols that are purpose-built for NFTs. The core state machine of a general purpose blockchain is a virtual machine. Instead, NFT specific chains implement the NFT data model directly. NFT behaviors are explicitly coded as state transition functions in the core protocol. ImmutableX is an example of a purpose-built state machine. Developers do not write the NFT data model in smart contracts. NFT functionalities are explicitly defined as core concepts by the protocol. Smart contract NFTs inherit blockchain characteristics of the host blockchain. The blockchain does the heavy lifting on making available a general purpose, deterministic state machine. An NFT is a small program that runs inside that virtual machine. ImmutableX is an application-specific, zero-knowledge (zk) rollup chain built on the Starkex platform and custom data availability solution. The chain sequence and process transactions in batches. The hash of the final state of each batch and a zk validity proof are posted to Ethereum. The ImmutableX platform relies on STARK validity proof to guarantee the validity of NFT state transitions. However, the protocol stores raw transactions in a custom network that is run by a data availability committee, which is equivalent to proof of authority. Programmability For example, a developer wants to program an NFT to represent a metaverse character. The character's skill development path should be governed by a set of rules. The character has to first attend woodworking training to gain certification. The certification unlocks her ability to enter quests. Future NFT use cases will expand greatly beyond the ERC721 standard. There are already many proposed standards on the Ethereum platforms, e.g. Permit, Micropayments, Subscription, Entangled Token, Multi Tokens, Re-Fungible Token, etc. There will also be custom use cases that won't be standardized. For example, a monster card game is encoded by NFTs. Cards could be combined. Cards could get updated based on previous battles. Each card's metadata is recorded on-chain. The cards' state transitions are different for each card game. For a second example, a developer wants to program an NFT to represent a sword that is interoperable in many games. There are specific rules that the sword could get upgraded. Special gemstones could be melted into the sword, or special monster kills could enhance a sword's characteristics. The rules are written as the state transition functions, and on-chain metadata are used to record the sword's current status. For a third example, a developer wants to program an NFT sword to be non-transferable once the sword gains a certain status. This is similar to soul bound token, with the exception that the NFT is soul bound to another in-game character, which could be another NFT . There will be countless such examples. There are also use cases that require the NFT program to operate on media usage data. For example, a developer wants to program an audio NFT such that the first 1000 listeners would get rewarded. For a second example, a developer wants to program a video NFT such that the NFT owner could claim a reward whenever the video has a 1000 incremental views. A creator works with a sponsor to create a video that embeds a short message from the sponsor, and the sponsor contributes to a reward. If the NFT transition functions only contain a reference to its media, the two previous examples cannot be programmed. When the media is stored and delivered by a third party service, the NFT state machine cannot learn who and how the NFT media is used. NFT platforms need to satisfy two requirements to allow for these future use cases. First, the platform allows a scripting environment to program user defined state transition functions. Second, the scripting environment has access to arbitrary on-chain metadata and metadata associated with media usage. Smart contract NFT platform trivially allows for on-chain metadata and scripting environments. The key challenge for smart contract NFTs is that on-chain metadata is expensive. This will be discussed more in the scalability subsection. Another challenge is that smart contract platforms do not have access to media usage data. This feature could be built separately to complement a smart contract platform. At a minimum, it requires coordination between the NFT state machine and the content delivery network. This will be discussed more in media storage and delivery subsection. ImmutableX does not allow user defined NFT behaviors. ImmutableX only supports basic ERC721 behaviors, so it does not need to allow for user defined programming. Furthermore, it depends on Starkex to generate and verify validity proof. Starkex only supports basic token operations that are defined in ERC -20, ERC -721, and ERC -1155. Starkex cannot verify state transitions of custom state machines, even those state machines were written in specialized Cairo programs. The choice of zk rollup makes adding user defined state transition more complicated. It would require their zk validity proof system to have the ability to interpret and validate user defined scripts because all the NFT level state transitions have to be aggregated into a single zk validation proof. Beyond scripting, ImmutableX has to add support for generic metadata. It is hard to envision that ImmutableX can add support for media storage and delivery. Stargaze is supposed to be a purpose-built application blockchain, but it is built as a blockchain with a general purpose virtual machine. The NFTs are implemented as smart contracts written in webassembly. The key difference is that developers are not allowed to deploy smart contracts directly. User defined smart contracts are posted on a forum and voted into deployment by a governance protocol. Scalability The discussion of blockchain scalability should be application specific. The most common metric of blockchain scalability focuses on the metric of transactions per second ( TPS ), which incorporates data from block time, block size, and transaction size. However, applications have different kinds of transactions. For example, some applications require a lot of transactions that create new key-value pairs on-chain. These transactions increase the state size incrementally. The blockchain has to deal with a storage bottleneck because transaction processors have to maintain the state in local storage. In a different example, payment applications mostly create transactions that are about fungible token transfers. The transactions are small, require low compute resources to process, and do not accumulate into a large state. TPS is a excellent metric for this type of application. For a third example, if transactions are computationally expensive, the bottleneck might be on processing power of validators. Different transaction types push a blockchain's boundaries in different ways. It could be bottlenecks of physical limitation in storage, CPU , or network. Or it could be bottlenecks that are imposed by design choices such as in gossip, consensus, and data availability mechanisms. Future use cases of NFT will increase the demand for on-chain metadata and user defined transactions that operate on those metadata. These use cases might push the limit on blockchain design choices. I will not go into details of the limitations of those design choices in this post. Instead, I will discuss the physical limitations that impact all blockchains regardless of how the implementationof mempool, gossip, consensus, and blockchain data model. Processing nodes might be able to avoid the bottleneck in network bandwidth or CPU resources. Network could reach 100MB /s, and it is possible there are hardware and software solutions that could allow a single machine to handle that data rate. Storage is a hard limit. NFT use cases will incrementally put more and more metadata on-chain. Programmability requires the state to be read and write accessible. The state grows indefinitely. If the accumulation rate is just 1 MB /s, the state grows to 30 TB in one year. The rest of the blockchain components might be able to handle transactions that summed to 100MB /s, but if even just 1% of those transactions are metadata updates, a non-sharded blockchain will be bottlenecked by storage. There are two common solutions to expand the state size beyond a single machine. One is state expiry and another is state sharding. The first strategy prunes the old data from the state. The strategy has to allow expired data to be recovered. The main drawback is dependence on third party services to hold the expired data. The second strategy is sharding. When the state grows beyond what could be handled in a single machine, the system uses more machines. This approach does not just scale storage, it increases throughput for network and CPU resources as well. As NFT usage of on-chain metadata increases overtime, blockchains will have to deal with the problem of large state. Ethereum approaches the problem with both a state expiry solution and sharding. Some blockchains are built as sharded chains, e.g. Near, Polkadot, and Sui. But some blockchains are not sharded, e.g. Solana, Polygon, ImmutableX. It is worth noting that a layer 2 chain such as ImmutableX could be sharded as well. They could shard their chain state and have different sequencing nodes for each shard. Each shard could still use the same non-sharded L1 layer to maintain state and verify validation proofs. There is no existing NFT platform that guarantees low cost, horizontally scalable on-chain metadata. The state of non-sharded blockchains will be limited to a single machine. It is also difficult to fully harness the scalability of sharded Even for smart contract NFTs on sharded blockchains, NFTs Sharded blockchains usually do not expose sharding control to smart contracts. For NFT smart contracts to fully leverage sharding, they have to adapt to the specific sharding architecture of the host chain so NFTs could interoperate across shards. A lot of the popular NFT smart contract platforms are blockchains that are optimized financial applications. On-chain data is too expensive for the purpose of NFT programmability. One solution is to build a sharded, NFT -specific blockchain. A blockchain of this kind could shard the chain according to the NFT data model and behavior. A lot of NFT behaviors are self-contained or limited to interact with closely related NFTs. These patterns are easily parallelizable. The chain should tailor the security requirement and cost structure for NFT use cases. Media Storage and Delivery Existing NFT platforms are designed to support collectible NFTs. Existing NFT platforms do not handle media storage and delivery. It is important to note that on-chain media storage is not feasible. The generally accepted practice is that the NFT contains a reference to the media. The media is stored and delivered through third party solutions that do not communicate with the state machines that manage the NFTs. A NFT platform that integrates the handling of media storage and delivery offers two major benefits. The first benefit is convenience and decentralization. NFT developers do not need to go to a centrally operated third party service provider to host the media. The second benefit is NFT programmability. When users interact with third party services to retrieve the media, the NFT platform cannot exercise any control over how that media is consumed or how to react to the usage behavior. This feature could be seen as an incentivized, peer-to-peer content network. The nodes participating in the delivery network interact directly with client applications. They fulfill media requests. These content nodes want to be paid for their work. The content nodes report back to a blockchain about their work and get rewarded there. This component could be built on top of smart contract NFT platforms as well. The blockchain network and the delivery network are connected via usage report transactions. The content nodes submit transactions to claim rewards. In theory, these user defined functions could be defined on any smart contract platform. The key challenge is that usage reports might be large transactions and require heavy use of on-chain metadata. I am not aware that anyone is working on such a solution as of the writing of this post. Developer Experiences Existing NFT platforms is not user friendly. Choosing a platform usually requires in-depth knowledge of decentralized infrastructure. Developers have to learn the best practices of smart contract systems because security is a must-have. Developers have to integrate with NFT marketplaces. They have to provide guidance to users on how to manage wallets, acquire cryptocurrencies, and navigate blockchain transactions. Developers might also need to develop purpose-built user interfaces so users could access and manage NFTs in contexts that are not already covered by existing marketplaces. ImmutableX alleviates one aspect of the challenges listed above. ImmutableX allows developers to interact with their chain through APIs calls. Developers do not need to write smart contracts to define NFT behaviors because all the available behaviors are predefined. The platform also provides users with a dedicated marketplace UI to interact with its NFTs. Developers should have the option to implement NFT behaviors with managed services. For example, a game developer could simply sign up for a service and acquire an API key to get started on implementing NFT functionalities. It could be as easy as integrating an SMS texting service using Twilio. The platform could also enable game users to interact with NFTs without requiring them to navigate a web3 wallet. Both the developers and users should always have the option to reclaim custody of their assets. Developers should always have the option to interact directly with the decentralized layers should they choose to go with that route. Footnotes It is possible to get around it with a complex setup of merkelizing the off-chain metadata state and the transition function takes in an additional argument in the form of a merkle proof. It makes a simple function very complicated. It also introduces the problem of data availability for the off-chain data because the function caller has to construct the merkle proof. ↩","tags":"misc, , web3","url":"/2022-09-01-nft-platform","loc":"/2022-09-01-nft-platform"},{"title":"Notes on Ethereum L2 Solutions","text":"I discuss three categories of L2 solutions: state channel, sidechain, and transaction aggregation. State channel is simple but limited. I do not think sidechain should be considered a L2 solution because it has to manage a security model of its own that is comparable in scope to running yet another L1 chain. Transaction aggregation has the most variety and is getting the most attention in research and development. State Channel State channel is the first L2 technique that is used to conduct secure peer to peer interaction. Payment channel is the easiest to understand. For example, two parties A and B setup a channel when both parties commit 100 dollars each. The setup is on chain. If A sends money to B, A signs off on an updated state \\((A=90, B=100, nonce=1)\\) . As long as B receives this state with signature, B would be sure that payment is finalized. Any payment could take this form. After the 1000th payment, the state could look like \\((A=20, B=180, nonce=1000)\\) . As long as both of them signed this state, and either party is sure that she had not signed anything with nonce greater or equal to 1000. They could cryptographically trust all the previous payments are correct. Whenever either party wants to close the channel, they would have to submit an onchain transaction. The idea of payment channel could be easily extended to state channel to allow arbitrary state transitions. The key advantage of state channel is instantaneous p2p interaction without any onchain execution. There are some limitations as well. First, state channel has a setup cost. Transactions could only happen for those who have already setup the channel. Second, state channel locks up fund. This locked funding also represents the transaction limit between the two parties. Third, participants have to keep track of the channel states. If they lose the latest and the counter-party refuses to provide a copy, the counter-party could use a previous state to close the chanel. Fourth, state channel is only limited to two parties. State channel could be extended to use a hub model. In the hub model, each user sets up a state channel with the same central X. Any transaction between A and B could conducted through X. This enables any pairwise transaction between participants that have a channel open with central operator X. This introduces a different issue, a centralization risk. If X goes offline, the entire system is down. X could censor certain transactions. Connext vector implements such a protocol. This technique is used as micropayment for the TheGraph. The hub model could be extended further still to a network model. Instead of one hub, there are many interconnected hubs. With the help of the network, a user could use the hubs to find a path to any other counter-party. This is the basic idea behind lighting network . Sidechain A sidechain functions effectively as an independent L1 chain. A sidechain runs in parallel of the mainchain. A sidechain has to operate its own consensus mechanism. If the sidechain's security is compromised, the mainchain is not impacted but also, the mainchain does not provide any guarantee to the users of the sidechain. The activities in the sidechain are independent of those happening on the mainchain until those activities are bridged back into the sidechain. For almost all intents and purposes, a sidechain behaves like an independent L1 chain with bridges. In general, I would not consider sidechains a layer 2 solution. However, Polygon's proof of state (PoS) chain is often mentioned as a layer 2 solution. In fact, many people consider Polygon to be one of the most successful L2 solutions. I disagree with that categorization. I put Polygon PoS in the category of a sidechain. Polygon PoS Polygon PoS is a blockchain built on top of the Tendermint consensus engine. Validators stake their MATIC tokens on Ethereum. The validators on the Polygon PoS Chain must stake their MATIC tokens and run a full node. The validator set is maintained in Ethereum smart contracts. A Merkle hash of polygon sidechain state is posted to Ethereum as a checkpoint. Assets are connected between Ethereum and Polygon through two way bridges. Polygon offers two bridges. The PoS bridge takes advantage of the validator set to confirm the transfers. The Plasma bridge uses a fraud proof mechanism to secure the transfers. It is important to note that the use of Plasms here only refers to the bridge. The EVM activities inside Polygon are not aggregated as Plasma checkpoint. The EVM activities are aggregated into a state hash checkpoint. If the PoS validators go rogue, users cannot challenge the Polygon transactions in L1. This PoS chain differentiates from other L1 chains, e.g. NEAR protocol or Avalanche, by maintaining its validator state and checkpointing its L2 state on Ethereum. These differentiations alow Polygon to call itself a committed sidechain instead of just another EVM compatible blockchain. Polygon maintains its validator set and their state on a smart contract. The Polygon validators periodically checkpoint the Merkle hash of the chain's state onto Ethereum. Instead of having these two system level state information on Ethereum, Polygon could just maintain them as part its L2 state by the validator set. Using smart contract to store these information has some values. First, the fork choice and rule of finality are backed by Ethereum. It eliminates the need to design and implement those mechanisms. L2 transactions are final as soon as those checkpoints are accepted by in L1. Second, there is data availability guarantee on its validator set and state checkpoints. However, these are only marginal benefits. The key security guarantee is the proof of state algorithm, the key data availability guarantee is on the set of validators keeping a copy of the all the historical transactions. Polygon as a committed sidechain is only marginally more secure than a hypothetical Polygon PoS that is set up as a completely independent L1 chain. Transaction Aggregation I use the phrase transaction aggregation as the category for many of the L2 solutions that are differentiated by their choices in two dimensions: the choice of verification mechanism and the choice of data availability ( DA ) solutions. The table below illustrates how to further categorize these different approaches. I won't be surprise if this table will get additional columns or rows in the near future. zk fraud proof onchain tx data zk rollup (Aztec, zksync, StarkNet) optimistic rollup (Optimism, Arbitrum) offchain tx data validium plasma ( OMG ) user choice volition (ImmutableX) n/a data availability committee (ImmutableX) n/a data availability L1 chains (Celestia, Polygon Avail) n/a Note that some categories have specific category names, e.g zk rollup, validium, and plasma. Some categories only have examples but no category name, e.g Celestia uses zk proof and another L1 blockchain for data availability. And yet, some categories do really have any working examples or category name; no one is working on those solutions. All of these L2 solutions process and aggregate individual transactions outside of the mainchain. Aggregation works as follow. Let say the starting L2 state is S0, this L2 state hash 1 is stored onchain in the mainchain. The L2 chain receives 1000 transactions. The L2 engine processes all of them and performs state transitions so the newest L2 state is S1000. The hash of S1000 is posted to the mainchain. There are two questions that the L2 protocol has to make clear: How do users know that the hash of state S1000 is valid? How do users reconstruct the state S1000 that corresponds to the hash? A L2 protocol has to provide a verification mechanism of the state transitions. The most popular mechanisms are zero knowledge proof and fraud proof challenge. Second, the L2 protocol has to allow the user to reconstruct state S1000 from state S0. For that to happen, the original transaction data have to be available to everyone 2 . The choice of verification mechanism is usually between zero knowledge proof and fraud proof challenge. I do not rule out that there are different techniques developed in the future. A zk proof cryptographically ensure state transitions are valid. A fraud proof challenge mechanism incentivizes users or watchers to validate the posted state transitions. There is a waiting period for the newly posted state to become finalized. Anyone could validate the posted state, and if they see something wrong, they post a fraud proof to revert the state and claim a reward. Zk proof has the advantage that posted state is finalized instantaneously. Zk proof could have additional privacy features as well. However, zk proofs requires heavy computations both in generation and verification. Zk proofs constructions are difficult to understand and audit, and the toolings and development experiences of zk framework are still a work in progress. See zk rollup discussion below for more details. The spectrum of choices in data availability ( DA ) exists because onchain data is expensive. How to make transaction data available is a tradeoff between data availability guarantee and cost. If the transaction data is not available to users, users cannot verify the state transitions and cannot access their accounts. By some estimate, calldata 3 costs about 16 gas per byte in Etheruem. Assuming a gas cost of 60 gwei and $3000/ ETH , 1kb of data costs close to $3 USD . At this price, both zk and optimistic rollup will have its limit both on overall throughput as well as high transaction costs. Even with compression, a zk rollup is still going to take up 10-15 bytes of data, so a rollup would cost $0.03 at a minimum. In practice, it is much high than that due to fix gas costs for each aggregation. The alternatives to onchain data gives rise to a variety of strategies for data availabilities. It should be noted that offchain storage was widely considered by the community before the recent rise in popularity of rollup, which use onchain solutions. There are also an emerging trend to use a third party and hybrid solutions for data availability. It should be noted that both zk and fraud proof are considered just as secure as L1. If data is also onchain, i.e. optimistic and zk rollup, it is generally accepted that they have the same level of security as L1. By offloading data availability away from the L1 chain, security is compromised to an extent. It does not mean it is not secure. One has to evaluate a data availability solution on its own. Optimistic Rollup (Optimism, Arbitrum) Rollup moves computation offchain, but all the original transactions data are kept on chain. For each set of transactions that a rollup protocol aggregates, a single summary transaction is submitted to the L1 chain. In this summary transaction, it contains the most up-to-date state of the rollup chain, and in the calldata field, it contains all the original transactions. They could be greatly compressed if the rollup is a special purpose protocol. If the protocol is general purpose, the compression level is less. Both Optimism and Arbitrum implement an EVM inside their rollup chains. Their rollup chains have a full full feature EVM . Optimism and Arbitrum have different implementation details. They implement a different virtual machine, even though both provide tooling to support solidity smart contracts. Their proof fraud verification mechanisms are different. Optimism is single round and Arbitrum multi-round fraud proof. See more details on other posts that discuss and compare both projects ( post1 , post2 ]. Zk Rollup (StarkNet, Zksync, Aztec) The development of zk rollup is behind optimistic rollup. The main reason is that zero knowledge computing platform and tooling are still early. It is still hard to generate zk proof for general purpose computations; hence it is hard to make full feature EVM zk rollup chain. The hope that is technology will improve overtime, and zero knowledge platform will become sufficient mature and general purpose soon. Many people believe that zk rollup would eventually be the better technology of the two flavors of rollups. Zk rollup has some advantages over optimistic flavor. Zk rollup has fast withdrawal period. Zk rollup allow for higher level of data compression because verification data could be collapsed into the zk proof. StarkNet is a zk rollup chain deployed on Ethereum. It is the flagship project of the StarkWare, a company that also develops zero knowledge friendly language and platform Cairo that aims to be the tooling for generating validity proofs for general computation. StarkNet does not natively support EVM yet, but it supports smart contracts that are written in Cairo. It is not hard to see that Solidity could be transpiled into high level Cairo or compiled into Cairo compatible byte code. StarkNet allows L1 and L2 smart contracts to interact. The interaction is a form of message passing. It takes advantage of the fast finality of zk rollup. Unlike optimistic rollup, as soon as the L2 layer posts the aggregated state transition to L1, those state transitions are considered final. The messages calls could be posted in the same L1 transaction as the L2 state hash and zk proof update. There are two other zk rollup solutions. Zksync is already on mainnet. Aztec is only on testnet. Similar to StarkNet, they are not offering a general purpose EVM yet. Both of them only support ETH and ERC20 token transfer so far. I have not spent the time to understand them in details. Plasma Plasma was once touted as the most likely candidate to scale Etheruem. Plasma was first proposed in this paper . There are a few additional iterations: MinimumViable Plasma ( MVP ) plasma, Plasma Cash , and then MoreViable plasma. Plasma never took off for a few reasons 4 . First, Plasma stores data offchain, it needs a solution for data availability. The proposed solutions could be a centralized operator, PoS, or PoA system. Because the operator(s) have control over data, most proposed implementation also give them sole control to be able to submit aggregation to the chain. The data availability layer and aggregation layer make a Plasma chain to behave effectively as a sidechain. The security of the overall system is as secure as the PoS or the centralized operator layer. Second, Plasma has a rather complex exit game. This problem is compounded because transaction data is only available to the validator set or the centralized operator in most implementations. Lastly, other techniques for data availability become popular. Teams that were working on Plasma solutions have one way or the other pivoted to working on a variant that uses similar fraud proof but a different data availability strategy. Plasma is not implemented for good reasons. When it was first proposed, the important of data availability was somewhat obscured. The blockchain community was mostly focused on developing consensus algorithm to ensure correctness of computation. It is much more clear now that we could prove correctness by cryptographic proof or a challenge mechanism. The harder part is to ensure that everyone could have access the data to verify state transition or generate fraud proof. In many ways, the issues encountered in building a plasma system naturally gives rise to the variety of data availability solutions today. Data Availability L1 Chains Celestium is an L2 solution that uses Celestia for data availability. Celetia is an independent data availability blockchain. Celestia itself is a proof of stake chain. It uses data sampling technique to ensure data availability. It has not launched its mainnet yet. Celestium has not publicly stated whether it will use zk or fraud proof as their verification mechanism. Polygon is also working on a general purpose data availability layer, known as Polygon Avail . It is based on Cosmos SDK , data sampling technique, and KZG polynomial commitment. The development of Celestia and Poly Avail aim to fill a void. Of the two dimensions in the categorization table above, the dimension of verification scheme is somewhat well settled. The community know that fraud proof mechanism is a matter of implementation, and the community is improving on the zero knowledge proof ecosystem. Once the zk proof stack is mature, the community expect zk rollup will become the go-to verification scheme. For the other dimension of data availability, onchain is most the easiest and the easiest solution, but it is also the most expensive. Ethereum is aiming to fix it, thus becoming an effective data availability chain. Polygon Avail and Celestia are attempting to accomplish the same thing. Volition Volition follows a simple idea that user could choose where the transaction data go. The user pays for the corresponding level of data availability. Onchain data costs the most, and offchain data costs nothing because there is no guarantee that the transaction data is recoverable. The existence of this category demonstrates that data availability is a tradeoff between cost and availability guarantee. ImmutableX allows user to choose whether transaction data to go onchain or to be validated by a data availability committee. ImmutableX ImmutableX is not a generic EVM platform. The protocol is designed specifically for NFTs. The protocol makes it easy to transfer, trade, and mint NFTs. ImmutableX aggregate ImmutableX protocol (L2) transactions and generate a zk proof. Its proof generation engine is developed in partnership with StarkWare, using StarkEx prover and verifier. ImmutableX uses a hybrid DA solution. It allows users to choose post the transaction data onchain. It also allows user to use a offchain mode. The offchain mode uses the strategy of Data Availability Committee ( DAC ). All members of the committee have to sign off that they have received a copy of the data. It is a 1 of n security model, in that only one of the committee members need to be an honest participant for data to stay available. References A blog post about data availability landscape A reddit thread on shared security Vitalik's writeup on rollup Footnotes One example is a Merkle Hash. But it could be a generic polynomial or even inner product commitment as well. See polynomial commitment . ↩ There is one key difference between the zk and fraud proof approach. If there are verification only information contained in the original transaction, e.g. signatures, those information could be ignored. A ZK proof on state transitions already retains the cryptographic evidences that all the state transitions are valid. The only data needed to be made available is the data that allow for reconstruction of state transitions from S0 to S1000. If verification is through fraud proof, all the original transaction data have to be available. ↩ Note that calldata is not stored in EVM memory. This data is not accessible by smart contracts. It is already much cheaper than bytes stored in EVM memory. ↩ See post1 and post2 for more details ↩","tags":"misc, , web3","url":"/2021-03-22-ethereum-l2","loc":"/2021-03-22-ethereum-l2"},{"title":"Quote of the Month","text":"If it is easy, everyone would be doing it. It is rewarding only if it is difficult.","tags":"misc, , quote","url":"/2021-03-17-quote-of-the-month","loc":"/2021-03-17-quote-of-the-month"},{"title":"Notes on ERC20 Stablecoins","text":"There are two major categories of ERC20 stablecoins: centralized or decentralized. A centralized stablecoin is issued by a single authority. The authority could mint any supply of the coins. A centralized coin holder has to trust the central authority for the pegged value to hold. A decentralized stablecoin implements a mechanism for minting, burning, and market operations to maintain a desired trading value. The overall crypto market swing wildly. It is easy to maintain a peg when the market is under price fluctuations of 10-30%. However, it is not unreasonable to assume that the overall market could contract 95%. One of the primary uses of stablecoins is have a mechanism to protect one's portfolio against such an event. If a stablecoin cannot reasonably survive such a contraction, it is not a viable stablecoin. One has to use a different strategy in case of market crashes, which is a probable event within the horizon of a few years at this junction of the crypto ecosystem. As of the writing of this blog, the circulating volume of centralized stablecoins is about 10x of that of decentralized stablecoins. The volume of the top three centralized stablecoins (Tether, USDC , and Binance USD ) is about $150 billion USD . The top 3 decentralized stablecoins (TerraUSD, DAI , and Fei) is about $25 billion USD . Centralized The most important thing to evaluate a centralized stablecoin is to evaluate the trustworthiness of the issuing organization. This is similar to how fiat currency holders evaluate a currency and its issuing nation state central bank. The key difference is that a central bank is backed by a monopoly of violence, and the authority of a stablecoin minter is the combination of its trust in the community, fund raising history, and legal obligation to its stated intents. That being said, I would probably trust the stability of a properly audited, sufficiently collateralized, and US -based 1 issuing company over more than 95% of the central banks authorities in the world. In the other words, holding a properly issued centralized stablecoin is probably safer than holding but a handful of real world fiat currencies. USDC The issuing organization is founded by Circle and Coinbase. To the best of my knowledge, both of these companies have the power to issue USDC . The guarantee is that for each USDC coin issued, there is a corresponding amount of fiat USD deposit in a regulated bank. I would suppose that Circle and Coinbase do not keep a fixed 1 to 1 ratio. Their business model takes advantage of fractional reserve to invest the collateral, similar to how retail banks makes money from retail deposits. Coinbase is a large public company. It has every incentive to comply with all of their stated intentions and how they are going to have sufficient funds to guarantee USDC withdraws. Circle is a well funded startup, and I would have to imagine that they have the same objective. I would feel comfortable holding a large chunk of my investment in USDC and believe that stablecoin value could be maintained under severe market turmoils. USDT (Tether USD ) Tether had been the most circulated stablecoin for a long time. It is also the most controversial. One has to do its own research, e.g. link1 , link2 , link3 , to make up their mind on how much to trust this stablecoin. I am comfortable accepting tethers in my defi transaction, even in large amount. However, I would not hold tethers long term. TUSD (TrueUSD) TUSD is issued by the company TrustToken. The company contracts a third party legal law firm to independently verify its digital assets to ensure that the issued TUSD could be claimed 1-for-1 in US dollars. It is similar to USDC . The key difference is how much I would trust TrustToken vs. Circle/Coinbase. I trust USDC more than TUSD . I trust TUSD more than I do Tether. I would not use TUSD as my primary stablecoin. BUSD (Binance USD ) Binance is an interesting company. It is the biggest crypto exchange in the world. It also has questionable legal structures. It is operating somewhat in the gray area in the United States. Because of its market dominance, it gets a lot of usage in many parts of the world for many of its product offerings. Stablecoin is one of its popular products, and its centralized smart contract, EVM chain is wildly popular. However, I consider Binance a temporary phenomenon because it primarily takes a centralized approach to the decentralized ecosystem. I am concerned about its legal status in the US . I have seen similar oversea companies that had to shut down their US operations with just a single regulation change 2 . I would not hold any significant amount in BUSD . Decentralized, Algorithmic Stablecoins Decentralized stablecoins rely on at least some decentralized algorithms to maintain their peg value. There are different ways to categorize these coins based on the implemented stabilizing mechanisms, e.g. algorithmic vs. asset-back, over-collateralized vs. fractional reserve, etc. I will not attempt further categorizations here. Some of the mechanisms might even be administered by a centralized operator, but overall, the aim is to keep the key operations as transparent and as automated as possible. The single most important criteria to determine the quality of any stablecoin is how much faith do I have on its pegged value under extreme market conditions. The only true test is that a stablecoin were to survive such an event. Unfortunately, the market swings in the past 3 years were not sufficient to convince me that any of the stablecoins discussed below would survive an extreme market contraction. I would not feel comfortable holding all of my crypto funds in any of these decentralized stablecoins and believe that they will hold the USD peg when the overall market crashes similar how ETH crashed in 2018. ETH went from $1200 to $100. DAI DAI was the first decentralized stablecoin. It seems that it takes a rather conservative approach. Every DAI that is minted, it needs to over-collateralized. For example, for every $100 of newly minted DAI , the protocol requires $150 worth of Eth deposited. The over-collateralized requirement might appear to provide a high guarantee on the underlying value of DAI . That is only true if the requirement to be 50% over is a sufficient buffer. However, ETH losing 50% of its value should not be considered a black swan event. We have seen ETH losing more than 90% of its value. The high collateral limit would not be sufficient. DAI is minted when a user creates a VAULT and deposits cryptos. Every DAI in circulation is directly backed by excess collateral. The protocol uses an oracle, i.e. Chainlink, to make sure that each vault has sufficient collateral. When the collateral falls under a threshold, the protocol sells the cryptos to cover its DAI position. If there are excess DAIs from the liquidation process, DAI is refunded to the vault holder. If liquidation is not sufficient, the protocol incurs a protocol debt. The protocol maintains a maker buffer. This buffer could be used to cover the protocol debt. This buffer could be increased by a debt auction. The protocol mints MKR and sells those to increase the buffer. Minting MKR is inflationary, and all MKR tokens holders lose equally. If the buffer gets too large, the protocol would try to buy back more MKR and burn them. This is a deflationary action and all MKR holders gain equally. MKR is the DAO token of the protocol. DAO could vote to update the protocol. DAO determines what cryptocurrencies are accepted as collateral. DAI is not capital efficient. The capital that are committed to mint DAI is locked up. Worst, the collateral level is above 100%. Centralized stablecoins could use fractional reserve. Even the high over-collateral level might not be enough if market falls precipitously. If Eth goes down 70%, almost all vaults have to be liquidated and still incurs protocol debt. MKR price could go down concurrently with ETH even before the all the liquidation take place. The remaining market capitalization of MKR might not sufficient to cover the protocol debt. At that point, there is no amount of minting of MKR is going to be able to cover the rest of the open DAI positions. DAI price would have to go below the peg. The best thing about DAI is that it is highly decentralized and the algorithm is simple. As long as the number of circulated DAI is not large relative to the market capitalization of MKR , I would not mind conducting most of my defi transactions in DAI . However, I would not uses DAI as my cash reserve when I am expecting market turmoil. DAI was an amazing innovation. It took the first step toward a viable decentralized stablecoin. But at the same time, it is an unsatisfactory long term answer. Fei FEI aims to be more capital efficient than DAI . FEI is minted when someone buys FEI against a bonding curve set by the protocol. The capital accrued by the protocol is used to conduct market operations to maintain the peg. Initially, the only bonding curve is denominated in ETH . All of the ETH is put into a uniswap liquidity pool. The amount of total liquidity roughly matches the total FEI in circulation. The stabilizing mechanism is mostly automatic. The incentive to push down the FEI price is simple to understand. If FEI is selling above $1, say $1.05. Someone could mint the token at $1 from the bonding curve, and then sells the token at normal exchanges for a $0.05 profit. When FEI is selling below the peg, the protocol controller has to buy FEI . I believe that this part is not fully automatic in the smart contracts yet and requires manual intervention. When the peg is below target for an extended period of time, the controller took out the uniswap pool and use the fund to buy FEI until FEI is back to $1. The remaining money will be put back into the uniswap liquidity pool. Hence, when FEI is selling at $0.95, a person has incentive to buy from an exchange (e.g. uniswap) because if the prices does not get back to $1, the protocol controller will buy it at $1. Oftentimes, the belief that someone will buy it at $1 is sufficient for the peg to be remained at $1. The person profits $0.05 one way or the other. FEI 's market mechanisms are clever. It allows for the protocol to bootstrap itself. It allows for redeemability at all times. It does not require capital lockup from the user's perspective even though it effectively creates a fractional reserve from the bonding curve. The incentive mechanisms are simple and direct. FEI holders should have a reasonable belief that the peg value holds under most conditions. The key question is always the same: will the peg survive a 95% market contraction. I am going to say no. But unlike DAI , I could see FEI surviving the initial shock. If the contraction continues, I would predict that FEI will lose values once most stablecoin holders want to exit the crypto market all together. During the initial shock, the protocol controlled value is less than 5% of the value of the circulated FEI . If there is a run at the bank, the protocol controlled value evaporated quickly and FEI falls to zero. However, during market turmoil, most stablecoin holders are not in a hurry to convert FEI to more ETH . In fact, there might be more ETH holders who are looking to buy more FEI , therefore increasing the protocol controlled value in relative terms. If the bear market continues and FEI holders want to leave the crypto ecosystem, it will become an effective run at the bank, albeit at a slow pace. If the volume of sell outpaces buy even by a little, the FEI market would collapses because of the large value of outstanding FEI . UST (TerraUSD) TerraUSD ( UST ) is a stablecoin on the terra blockchain. Terra is a layer 1 blockchain that focuses on implementing stablecoin mechanisms. TerraUSD ( UST ) is the stablecoin that is pegged to USD . Terra chain supports stablecoins that are pegged to other fiat currencies or currency basket as well. LUNA is the native token of the terra chain. LUNA is the staking token used to secure the chain. Stakers validate transactions and earn rewards from transaction fees. The UST could be bridged to Etheruem as a ERC20 token. UST 's peg value is obtained by balancing the supplies of LUNA and UST . Its incentive mechanism crucially depends on the assumption that UST price could be sufficiently influenced based on expanding and contracting the total circulated UST . When the price of UST is above $1, the protocol incentives users to burn UST to mint LUNA . When the price of UST is below $1, the protocol incentives users to buy UST with LUNA . Similar to DAI and FEI , the mechanisms should work sufficiently well under normal market conditions. When LUNA and the overall crypto market crashes, the incentives might not be strong enough for UST to maintain the desired pegged value. LUNA could lose so much value that the outstanding value is only a small fraction, say 5%, of the total outstanding UST . At that point, even 90% of LUNA being burned to buy UST might not be sufficient to counter the downward price pressure. I suspect that its survivability is similar to FEI . The primary difference would be on the strength of the based assets: LUNA vs. ETH . If both of those assets fall equally in percentage, Terra and FEI 's market mechanisms probably perform similarly. While it is not a guarantee that the peg is not maintained, it is also not a guarantee that the peg is maintained, neither. It all depends on if there is a genuine panic sell on UST . The panic could be entirely avoided only if there is a guarantee that even in the event of a panic, peg value holds. That is one of the key insights of modern microeconomic theories. Final Thoughts I like decentralized stablecoins. I primarily use decentralized stablecoins in defi transactions. I want to use decentralized stablecoins only. However, I do not have a firm belief that any of the existing decentralized stablecoins could survive extreme market movements. Cryptocurrencies are known to be volatile, and I have seen huge crashes in the recent past. The belief that a decentralized stablecoin might not maintain value under a panic sell is the reason why a panic sell will happen. As much as I dislike centralized solutions, I expect a well collateralized centralized solution with a diversified asset portfolio to hold its peg even when the issuing company's stocks crashes along with the overall crypto market. It would at least hold for as much as the underlying assets, which correlates with the asset prices outside of the crypto market. The biggest risk with centralized stablecoins is regulatory. The U.S. government has an outsized and unilateral power to completely freezes assets of the issuing authority, which would effectively kill the corresponding stablecoin. However, that risk is minimum compared to other market risks as of now. If my holdings were already in an centralized exchange (e.g. Coinbase), I would be indifferent about holding USD or USDC when I sell my portfolio in anticipation of a crash or in the middle of crash. If my holding were to be in decentralized wallets, I would only mostly hold centralized stablecoins and only hold small amount of decentralized stablecoins in a stressed market. Footnotes The U.S. effectively has a monopoly in the world's financial systems. If a company does not have serious operations in the US and its issued stablecoin exceeds in the 10s of billions of dollars, the US congress or even a simple executive order could completely freeze that company's financial assets even when those assets are not held by US institutions. Financial institutions around the world cannot defy US rules and regulations in the slightest, or they themselves could be blacklisted. ↩ Pokerstar was one such example. ↩","tags":"misc, , web3","url":"/2022-03-15-stablecoins","loc":"/2022-03-15-stablecoins"},{"title":"An Example of Polynomial Commitment","text":"A vector commitment is scheme that allows one to commit to a sequence of values while keeping them hidden. The committer is also able to reveal the values later by providing a proof. Merkle tree is a form of vector commitment. The sequence of values are the leave nodes. The commitment is the root hash. The revealing strategy is by way of a Merkle proof. One of the limitation of a Merkle proof is the proof size is \\(O(k \\log_k n)\\) . A polynomial commitment is a strictly more expressive scheme than a vector commitment. A polynomial scheme could be used as a vector commitment. Let's make the setup more concrete. We have a sequence of values that we want to commit, \\(\\{x_i\\}_{i=0}^n\\) . We want a construction where we have a constant size \\(C\\) to represent these values, and we also want a constant size \\(\\pi\\) as a proof to the authenticity of a value \\(y=x_i\\) . The following scheme is the KZG polynomial commitment. Preliminary The construction depends on pairing based cryptography primitives. I am not going into the technical details. You can find some a simple introduction to one such construction in my BLS post . The basics is that there are three groups, \\(\\mathbb{G}_1\\) , \\(\\mathbb{G}_2\\) , and \\(\\mathbb{G}_T\\) constructed on the prime field \\(\\mathcal{F}_r\\) . \\(r\\) is the prime field order. Let \\(G_1\\) and \\(G_2\\) be the group generator of \\(\\mathbb{G}_1\\) and \\(\\mathbb{G}_2\\) respectively. There is a bilinear map \\(e: \\mathbb{G}_1 \\times \\mathbb{G}_2 \\mapsto \\mathbb{G}_T\\) . There are also arbitrary many of these values known by everyone, \\(t^k G_1\\) and \\(t^k G_2\\) , where \\(k= 1, 2, ...\\) The trapdoor value \\(t\\in \\mathcal{F}_r\\) is discarded and not known by anyone. We just need as many of these parameters as the size of the sequence we want to commit to. Commitment Create a polynomial from the value sequence \\(\\{x_i\\}\\) . This is straight forward, and could be accomplished by lagrange polynomial . The idea is simple. The polynomial could be just set to be as high degree as possible so all the values could be encoded in the polynomial. The polynomial \\(p\\) is such that \\(p(i) = x_i\\) . We write this polynomial in terms of its coefficients, \\begin{equation} p(z) = \\sum_{k=0}^{n-1} p_k z^k \\end{equation} The commitment is \\begin{equation} C = \\sum_{i=0}^{n-1} p_i t^k G_1 \\end{equation} Note that the commitment \\(C\\) is a curve point in \\(\\mathbb{G}_1\\) . It is a curve point regardless of the size of the original sequence. Proof We want to construct a proof of a single value, \\(y=x_i\\) . Because we convert the value sequence into a polynomial \\(p\\) , we want to prove \\(p(i) = y\\) . Let's define another polynomial, \\begin{equation} q(z) = \\frac{p(z) - y}{z - i} \\end{equation} Again, we re-write this polynomial in its coefficient form, where \\begin{equation} q(z) = \\sum_{k=0}^{n-1} q_k z^k. \\end{equation} The proof is \\begin{equation} \\pi = \\sum_{i=0}^{n-1} q_i t^k G_1 \\end{equation} Note that the proof \\(\\pi\\) is a curve point in \\(\\mathbb{G}_1\\) . Verification We only need to verify the following relation. \\begin{equation} e(\\pi, tG_2 - iG_2) = e(C-yG_1, G_2) \\end{equation} The commitment is \\(C\\) . The claim is \\(y = x_i\\) . The proof is \\(\\pi\\) . All of these are public information. \\(tG_2\\) are public as well. Notes The single point proof could be extended to a multiproof. The proof size remains to be a single curve point, hence constant size. We could use the polynomial commitment scheme instead of the Merkle tree vector commitment scheme to represent a trie. The proof of a k-ary Merkle tree requires each path to contain \\(k-1\\) many hash values, the number of hash values of the interested node's siblings. The proof size is \\(k \\log_k n\\) . However, the polynomial commitment proof would allow us to use a single curve point to represent the proof at each level. It reduce the needs to list out sibling's hash values. Hence the overall proof size is \\( \\log_k n\\) . This could actually be one better. Because the revelation proofs are curve points, they are additive. We could use a single curve point to prove all parent-child relationship for each of the level. The overall proof size could be reduce to constant size. References: Vitalik's description of Verkle tree Dankrad Feist's description of Verkle trie Polynomial and Ethereum state root","tags":"misc, , crytography","url":"/2021-10-11-basic-polynomial-commitment","loc":"/2021-10-11-basic-polynomial-commitment"},{"title":"BLS -12-381 Basics","text":"This post lays out the bare minimal basics required to understand BLS signatures. Most of the hard work is about setting up cryptographic primitives. To make things concrete, I only use the details from the BLS -12-381 curve. The post on vanilla elliptical curve cryptography contains some basics on how EC groups are defined. Preliminary Let's define the following constants. k = 12 r = 0x73eda753299d7d483339d80809a1d80553bda402fffe5bfeffffffff00000001 q = 0x1a0111ea397fe69a4b1ba7b6434bacd764774b84f38512bf6730d2a0f6b0f6241eabfffeb153ffffb9feffffffffaaab u = -0xd201000000010000 Note that \\(r = u^4 - u^2 + 1\\) is a 255 bit prime, and \\(q = \\frac{1}{3} (u-1)^2(u^4-u^2+1) + u\\) is a 381 bit prime. We also define field extensions. See the Implementing Pairing paper for more details. \\begin{eqnarray*} \\mathcal{F}_{q^2} &=& \\mathcal{F}_q[u]/(u^2 − \\beta), \\beta \\in \\mathcal{F_q} \\\\ \\mathcal{F}_{q^6} &=& \\mathcal{F}_{q^2}[v]/(v^3 −ξ), \\xi \\in \\mathcal{F}_{q^2} \\\\ \\mathcal{F}_{q^{12}} &=& \\mathcal{F}_{q^6}[w]/(w^2 −\\gamma), \\gamma \\in \\mathcal{F}_{q^6}. \\end{eqnarray*} First Group The first group is defined by the curve equation on the prime field \\(\\mathcal{F}_q\\) : \\begin{equation} \\mathbb{G_1} = \\{ x, y \\in \\mathcal{F}_q: y^2 = x^3 + 4\\} \\end{equation} \\(\\mathbb{G_1}\\) has a cofactor of \\(h_1 = (u -1)^2 / 3\\) , which is a 32 bit integer. A point in this group is a value with two coordinates, each is a 381 bit integer. The point could be encoded in compressed form with only 48 bytes. Only one coordinate needs to be provided. Let \\(g_1 = (x, y)\\) be this group's generator. x : 0x21a6 d67ef250 191 fadba 34 a0a301 60 b9ac92 64 b6f95f 63 b3edbe c3cf4b2e 689 db1bb b4e69a41 6 a0b1e79 239 c0372 e5cd7011 3 c98d91f 36 b6980d y : 0x0118 ea0460f7 f7abb82b 33676 a74 32 a490ee da842ccc fa7d788c 65965042 6 e6af77d f11b8ae4 0 eb80f47 5432 c666 00622 eca a8a5734d 36 fb03de Second Group The second group is defined by the following curve equation on the quadratic field \\(\\mathcal{F}_{q^2}\\) . \\begin{equation} \\mathbb{G_2} = \\{ x, y \\in \\mathcal{F}_{q^2}: y^2 = x^3 + 4(i + 1)\\} \\end{equation} \\(\\mathbb{G_2}\\) has a cofactor of \\(h_2 = (u^8 - 4u^7 + 5u^6 - 4u^4 + 6u^3 - 4u^2 - 4u + 13) / 9\\) , which is a 127 bit integer. A point in this group is two values in \\(\\mathcal{F_{q^2}}\\) . It is 96 bytes in compressed form. Let \\(g_2 = (x, y)\\) be this group's generator. Note that \\(x, y \\in \\mathcal{F}_{q^2}\\) . To be more explicit, \\(x_0, x_1, y_0, y_1 \\in \\mathcal{F}_q\\) , and \\begin{eqnarray*} x &=& x_0 + x_1 i \\\\ y &=& y_0 + y_1 i \\end{eqnarray*} x_0 : 0x24aa2b2 f08f0a91 26080527 2 dc51051 c6e47ad4 fa403b02 b4510b64 7 ae3d177 0 bac0326 a805bbef d48056c8 c121bdb8 x_1 : 0x13e02b60 52719 f60 7 dacd3a0 88274 f65 596 bd0d0 9920 b61a b5da61bb dc7f5049 334 cf112 13945 d57 e5ac7d05 5 d042b7e y_0 : 0xce5d527 727 d6e11 8 cc9cdc6 da2e351a adfd9baa 8 cbdd3a7 6 d429a69 5160 d12c 923 ac9cc 3 baca289 e1935486 08 b82801 y_1 : 0x606c4a0 2 ea734cc 32 acd2b0 2 bc28b99 cb3e287e 85 a763af 267492 ab 572 e99ab 3 f370d27 5 cec1da1 aaa9075f f05f79be Third Group and Bilinear Map The third group is \\(G_T\\) , the \\(r\\) -th roots of unity in \\(\\mathcal{F}_{q^{12}}\\) . It is a subgroup of \\(\\mathcal{F}_{q^{12}}\\) , with an subgroup order \\(r\\) . The bilinear map \\(e : \\mathbb{G_1} \\times \\mathbb{G_2} \\mapsto \\mathbb{G_T}\\) is defined as the following. For any \\(P\\in \\mathbb{G_1}\\) and \\(Q \\in \\mathbb{G}_2\\) , \\begin{equation} e(P, Q) = [f_{r, P}(Q) ] ^{(q^k - 1)/r} \\end{equation} Note that \\(r\\) is the group order of \\(\\mathbb{G}_T\\) , \\(k\\) is the embedding number \\(12\\) , \\(q\\) is the 381 bit prime. \\(f_{r, P}\\) is a polynomial, the miller function. Miller function could be computed by Miller's algorithm, which looks like a double-and-add technique. These parameter setup could be found in the IETF specs . Signing We could use either \\(\\mathbb{G}_1\\) or \\(\\mathbb{G}_2\\) as the public key space. To keep things simple, we use \\(\\mathbb{G}_1\\) as the public key space. Public keys are 48 bytes. A secret key is any number \\(\\rho \\in \\{1, 2, ... , r -1\\}\\) . The associated public key is \\(\\rho g_1 \\in \\mathbb{G}_1\\) . Let the message be \\(m\\) . Assume that we have a hash function, \\(H: M \\mapsto \\mathbb{G}_2\\) , where \\(M\\) is a the message space. The signature is calcuculated as \\begin{equation} s = \\rho H(m) \\in \\mathbb{G}_2. \\end{equation} Note that the signature is a point in the group \\(\\mathbb{G}_2\\) . It is 96 bytes in length. Verification Verify the following relationship is sufficient, \\begin{equation} e(\\rho g_1, H(m)) = e(g_1, s). \\end{equation} Note that \\(\\rho g_1\\) is the public key of the signer. Aggregation Let say that the message is signed by many public keys, \\(\\rho_1 g_1, \\rho_2 g_1, ..., \\rho_n g_1\\) . The signatures are \\(s_1, s_2, ..., s_n\\) . The aggregate public key is then \\begin{equation} \\rho_A g_1 = \\rho_1 g_1 + \\rho_2 g_1 + ... + \\rho_n g_1. \\end{equation} The aggregate signature is \\begin{equation} S_A = s_1 + s_2 + ... + s_n \\end{equation} The verification critieria is more or less the same. \\begin{equation} e(\\rho_A g_1, H(m)) = e(g_1, s_A) \\end{equation} Notes Using the BLS -12-381 curve is straight forward once all parameters are well setup. The name of BLS -12-381 comes from \\(k=12\\) , which is the embedded degree. One way to interpret it is that \\(12\\) is chosen so that \\(\\mathcal{F}_{q^{k}}\\) contains all the \\(r\\) -th root of unity. \\(381\\) is the bit length of of the prime number \\(q\\) used to define the finite field, \\(\\mathcal{F}_q\\) . The role of \\(\\mathbb{G}_1\\) and \\(\\mathbb{G}_2\\) could be swapped. One signing party could tamper with the aggregate public key. It is sufficient to ask each public key holder to shows that they also has possession of the corresponding private key. For the sake of simplicity, I ignore most of the discussions about security. I will add those discussion in a later post if there are interests. References A gentle but verbose introduction to BLS12 -381: IETF specs Ethereum BLS Spec Miller algorithm How to compute pairing","tags":"misc, , crytography","url":"/2021-09-25-bls-12-381-basics","loc":"/2021-09-25-bls-12-381-basics"},{"title":"Notes on Decentralized Storage Networks","text":"I am posting some notes overviewing decentralized storage networks ( DSN ). Before I go into the details of specific networks, let's explore some general features of DSN . Traditional blockchain networks could act as storage networks. Public ledger usually impose additional structures on how bytes are structured in the forms of blocks and transactions. Transactions are the basic unit that have to satisfy the blockchain unique transition function from one transaction to the next. Most blockchains allow users to put in arbitrary bytes as a special data field in transactions. However, these blockchains should not be considered storage networks because of limited data throughput. I had previously written about blockchain throughput . Without compromising the risk of centralization, a reasonable throughput of blockchain is about 10 KB /s per shard. This space is very limited and expensive. It would require millions, if not billions, of dollars to store a movie on one of these blockchains, even if the chain is sharded. The fundamental reason that a public ledgers should not evolve to become a storage network is because the security level required by a public ledger is far more expensive than what is necessary for storage. The data within a public ledger has to be validated by a securing resource. That securing resource could be proof of work, proof of stake, proof of authority, or any other scarce resource. In some ways, the validity of the data has to be voted on by some limited resources; otherwise, if an arbitrary amount of voting power is manufactured out of the thin air, the public ledger is not secured. (I will write a post on theoretical fundamentals blockchain security.) Furthermore, the ledger's data has to be also massively available. Data availability is inherently expensive. Replicating data requires physical resources in the form of hardwares, bandwidth, real estate, and electricity. If a piece of data is replicated 5 times, it costs roughly 5x in physical resources. Public ledgers requires data to extremely resilient to failures. Bitcoin and Ethereum chain states are replicated tens of thousands times, if not hundreds of thousands or even millions times. Even for a sharded chain, it will have 100s of replication at a minimum. This replication level would be impossibly expensive for petabytes of data, let alone exabytes or more. It is worth noting that AWS S3 charges $500,00 - $2,000,000 a year for 10 petabyte of data. It is safe to assume that data there has 3-10 replicas. AWS could be operating with a great margin, but it is hard to see that AWS could run a profit if it is replicating data 100s over. Decentralized storage network should allow users to choose their own replication level. Database networks could be loosely categorized into three typologies. In roundabout ways, I have already discussed two of those. The first topology type is single machine DB , where the DB is replicated in every full node of the network. The current version of Bitcoin and Ethereum is of this variety. The second type is a sharded db, with each shard replicated by a group of nodes. Polkadot and Eth2 are examples of this type. The third network type expect heterogeneous storage nodes. Each node store a unique set of data segments. One would find this network typology for almost all large scaled database deployment. S3, Hadoop file system, Cassandra, HBase, Druid DB , Elasticsearch, etc. A decentralized storage network will likely take this form, where there are a large number of storage nodes storing unique set of data segments. A decentralized storage network needs to have a mechanism to coordinate incentives and storage proofs. Storage providers have to be incentivized to participate in the network. Storage clients or the network have to be able ask for storage proofs from storage providers. This boils down to the need to coordinate storage proofs and payments. A blockchain, either native or external, is a natural choice to coordinate these interactions if decentralization is required. In the next sections, I will review some popular DSNs. I am not aiming to have a full description of these networks. My goal is to provide a succinct overview of the key features so I could discuss what I like and don't like about the different networks. IPFS IPFS is often mentioned as the leading solution for decentralized storage. IPFS could be used to storing and accessing data. However, it does not have an incentive mechanism. There is no guarantee that the IPFS nodes would continue to keep the data or provide access. IPFS is a network of nodes that are joined together to form a distributed hash table ( DHT ). IPFS has additional structure such as InterPlanetary Linked Data ( IPLD ) and Merkle direct acyclic graphs ( DAG ) that enable a variety of use cases to be built on top of its network. IPFS alone cannot act as a data storage solution. Filecoin The Filecoin network uses a native blockchain to enforce storage deals. The blockchain is similar to the Bitcoin or Ethereum chain. Each block is composed of messages. Message is analogous to transactions. A blocks is produced by a miner selected by the storage power consensus mechanism. Storage power consensus is similar to proof of stake, but instead of using stake to gain a higher probability of becoming a block producer, the mechanism favors storage providers with more proven storage capacity and storing more validated data. The contents of messages record storage deals, which are negotiated between storage clients and providers. The negotiation could be done in any communication channels. The completed deal is published on-chain. Storage miners are incentivized to submit storage proofs because doing so increases their probability of earning block rewards. The submitted proofs are included in the blocks. Miners encode the data in a process known as sealing. The slow encoding ensures that the miners could not re-compute the encoding on demand when an audit is requested. A SNARK on the encoded data is used as the replication proof to reduce the proof size. This is known as proof of replication . The miner is expected to submit additional proofs for the possession of this encoded data on a regular interval. It is incentive compatible for the miners to keep the encoded data on local drives because otherwise it would not be able to regenerate the data or fetch the data from remote storages. This is known as proof of spacetime . It is worth pointing out that the Filecoin blockchain is essentially a single shard blockchain. The stored data is not on-chain, but the administrative messages are on-chain. A blockchain has limited throughput. See blockchain throughput for more details. The chain snapshot as of 2021/09/05 was already at 600 GB . It is not hard to see that the chain's daily throughput will reach saturation quickly if usage increases. Regardless of how much storage the total miners pool could support, the bottleneck would be on how much throughput the blockchain could support storage deals. This is not all that different from the total transaction throughput of a general purpose blockchain such as Ethereum. I will discuss this more toward the end of the post. Arweave Arweave is designed to offer permanent storage. One of key features of Arweave is that it does not chain the blocks together as a singly linked list. Instead, it introduces the concept of recall blocks to form what a blockweave . Each block contains a reference to a random previous block. Having access to those recall blocks is what the mining nodes are competing on. The mining nodes could only produce a new block if they store the specific recall blocks, depending on the randomness as denoted by the new block's hash. The more recall blocks the mining nodes store, the more probable they could produce new blocks. The network also provides access to metadata in the from of block hash list and wallet list. These summarizing data allow transactions to be verified without reprocessing historical blocks. The blockweave acts as both the incentive layer and the data storage layer. The blocks in the weave records information about the account state and rewards, and also crucially, the blocks also store arbitrary data. The permanence of the blockweave equates to data permanence. Arweave's alternative approach to block linkage offers some unique advantages. First, it allows for heterogenous storage nodes, which is a requirement for a storage network. This allows the network to scale linearly because nodes are not required to replicate the exact same data. Second, it combines the storage and incentive layer into a single data structure. Third, the use of block hash list and wallet list allow a new node to join the network quickly. This eliminates the need for a light client implementation. Arweave incentives nodes to provide access to historical blocks through a reputation system known as Wildfire. It ranks nodes based on how frequently and how quickly they respond to data requests. Nodes that do not regularly participate could get blacklisted. Blockweave's claim of permanent storage might not be guaranteed if the size of the blockweave gets large and sufficiently decentralized. There is no formal guarantee that all blocks will be available. Nodes are incentivized to store as many blocks as possible, but equally, they want to make their blocks as rare as possible. This could lead into suboptimal social behavior. First, it is possible that some rare blocks got lost if there are not sufficient replicas in the network. This could be mitigated by adjusting the token economics such that we can probabilistically guarantee that there are certain number of copies under optimal conditions. Even with that, if the replica number is in the single digit, it is not hard to see some data segment get lost from time to time. Second, miners are incentivized to hold on to rare private blocks so that they would become the only one who could mine some blocks. While this is mitigated by the reputation system, but some miners with a short horizon would risk their reputation to maximum short term profit and exit the market. The centralization of these blocks would increase the probability that these blocks could be lost permanently. The availability of the metadata could be a weak point in the system. The network makes an implicit assumption that block hash list and wallet list are valid and accessible. These metadata have to be keep up-to-date by all the nodes in the network. These metadata acts similarly to the blockchain as global state. This leads to two problems. First is that if the wallet list increases, say 10 billions, the metadata would be hard to propagate within the network. It would become increasingly hard to gossip the updated state across the network. It could take days to download a copy of that data for nodes that just joined the network. Second, it is not fully specified how the network could guarantee the integrity of these metadata. In theory, The metadata could be recalculated from all the previous blocks. But as the weave's total size increase to exabyte level, it would be prohibitively expensive to recompute all the blocks by any one party. An attacker could concentrate its resources on gossiping malicious blockhashes to corrupt the weave, or gossipping wallet lists for monetary gains directly. Sia Sia uses a blockchain to manage file storage contracts, storage proofs, and payments. The contract negotiation is off-chain, but the completed contracts are recorded as on-chain transactions. Transactions are similar to Bitcoin transactions but with some extensions, allowing contract creation, storage proof, and contract update messages. Similar to how Filecoin uses the blockchain as the coordinating device to maintain a contract's storage proofs and payment, the blockchain's throughput is limited. Once a storage contract is published, the network requests the storage proofs on regular interval. Storage proof is a randomly requested segment of the original file and the appropriate Merkle branch. Because the proof requires the provider submitting an actual data segment that walks up the Merkle tree, the data segment has to be at least as large as the smallest Merkle leaf. This leads to a tradeoff. If the leaves of the Merkle tree are small data segments, the tree is large; otherwise, the data segments are large. Either way, it is easy to see that for a 10 GB storage, the proof could easily take more than 10KB . The chain has to put these bytes into the blocks, further reducing the space available for contracts. Payments are through payment channels and therefore, primarily off-chain. Payment channel eliminates a lot of on-chain transactions, but it also creates a dependencies that storage clients have to be connected to the network. It discourages individual users from directly storing the data on the network. Instead, users would most likely go through a service provider because users do not want to perform the work of regular audit and payments. This flaw is not critical because even an one-person application development shop could maintain a long running service to keep payment channels up-to-date. The frequency of interacting with the network is small, the same frequency as the data possession proof is accepted on chain. As long as the software is sufficiently open sourced, and it is easy for third-party vendors to setup to provide this service according to an open protocol, it is an acceptable requirement for a DSN . Safe Network The network maintains local state by segmenting the network into group by XOR closeness. Instead of using a blockchain to maintain state, the network segments the network into local group by XOR distances, the same metric used in Kademlia. The group manages its own state. The key managing nodes within each group would assign and re-assign data based on the group's membership. This technical overview describes the network to be built on top of a Kademlia-like DHT . The key feature that the Safe Network innovates on seems to be its ability to guarantee data availability as nodes join and exit the network. But it is not exactly clear how the network could self health. The network explicitly excludes the usage of blockchain to maintain a global state. This begs the question of how the network provides incentives to the network participants. As best I could understand from the available documentations, Safe Network uses a MaidSafeCoin that exist still as a coin on Bitcoin's omnilayer. I could not understand how the coin would be part of the incentive mechanism of the storage network other than the coin is used as direct payment to node or group of nodes. This payment channel would happen outside of the Safe Network. The Safe Network's proposal looks a lot like an improved version BitTorrent or IPFS . While it offers some level of data persistence guarantee that IPFS does not, the Safe Network does not fully incentive storage provider to participate. The idea of not requiring a global state is appealing, but the use of local state must extends beyond data management and incorporate incentives. A BitTorrent-like storage network could only autonomous and self-healing if each participating nodes are paid to stay online. Data storage requires physical resources, and someone has to pay for those physical resources. Storj Storj does not use a blockchain to maintain state. Published storage contracts are maintained by the private databases in satellites, which are centrally operated services. A satellite manages accounts, API credentials, billings, payments, audits, repairs, and various administrative tasks. Both storage clients and providers make API requests to a satellite to publish and maintain storage contracts. Accounts from different satellites do not interact with each other. Clients and providers could negotiate storage contracts through any communication channels. The negotiations are not recorded, but they publish the contracts to a satellite. The satellite accepts the contract and subsequently service the contract by requesting storage proofs and handling payments. The storage proof is in the form of a challenge mechanism. The data owner built a Merkle hash with secret salts at the leaves. Storage provider has to calculate a Merkle branch based when a secret salt is reviewed to them. The provider could only successfully answer the challenge if it has the data to combine with salt to produce the correct hash that corresponds to the Merkle branch. The client library has built-in toolings to encrypt data and segment the data by way of erasure encoding. Both encryption and erasure encoding are enforced by the network. Storj only accepts Storj coin as a payment mechanism. It is a coin on the Ethereum network. Ethereum is only used to facilitate payment. The EVM is not used to store state or coordinate contracts. Storj uses the concept of satellite to maintain state instead of the approach of a public ledger. A satellite has full control over the accounts that it manages. If a satellite experiences a technical failure, the associated accounts would cease to function. In most cases, the storage clients would not be able to recover the data because they would not be able to find the corresponding storage providers. A satellite is a trusted entity. For example, this is a list of trusted satellites in operations today. While storage nodes are decentralized, the network relies on centrally operated satellites to coordinate the clients and providers. Other than centralization risk, the scalability of the network also hinges on the satellites' abilities to scale their services. The Storj network's overall capacity largely depends on the satellites' capacity to handle accounts and storage contracts. Discussions The promise of decentralization could easily be eroded in storage networks. While some storage solutions use decentralized networks as storage backends, their services depend on centrally operative services. The first example is that Storj's use of satellites to manage accounts, storage proofs, and payments. The second example is IPFS and pinning services, e.g. Pinata. The service provider stores data on IPFS . It guarantees data persistence by running IPFS nodes and pinning user data on those private servers. The availability of users' data is only guaranteed if Pinata could successfully maintain uptime of their IPFS nodes. The third example is Filecoin and easy-to-use storage gateways, e.g. Textile Hub and Web3.storage . The stored data is ultimately routed to the Filecoin network, but these services manage accounts, packaging the data, and negotiate with storage miners. The fourth example is Sia Skynet. It is a portal built on top of the Sia network. Skynet give users and application developers a superior user experience, but it also means that data go through an intermediary before landing on the decentralized network. The promise of permanent storage might not be desirable. Arweave and Safe Network proclaim that data will be kept forever on their networks. Despite the many claims that storage has been getting cheaper, storage is expensive. An averaged American household cannot afford storing petabytes of data. An averaged company cannot afford petabytes without at least extracting tens of thousands of dollars annually from those data. One might point to Bitcoin and Ethereum network as example of permanent storage. The amount of storage is extremely limited. 1GB of storage in those network will cost tens of million of dollars or 100s of millions depending on market conditions. A network could keep its permanence guarantee if the value of the token economics always outpaces the actual cost of physical resources. That is not a stable equilibrium, nor is it desirable. With a sufficient large stable storage network, the marginal benefit a node gains should be the marginal value that it provides in storage. The overall market design should account for the costs of continuing to pay the storage providers to keep the exabytes of data around month after month, and year after year. When large amount of data becomes useless and no one is willing to pay for it, the world should not be wasting resources to keep them around. Storage contracts require a consensus mechanism. A blockchain is the most obvious solution. However, none of these networks has thought through how to scale the blockchain part of the network. It is fortunate that blockchain scalability has been gotten a lot of attention due to the rapid development of general purpose blockchains, such as Ethereum, Polkadot, Near, and Solana. There have been major advances in recent years. The most mature solution is to shard a blockchain. One of the key limits of sharded blockchains is the amount of cross-shard communications allowed. Storage contracts do not need to communicate across shards. This would make a sharded solution especially well suited as the coordinating layer of a DSN . Furthermore, the security of the additional shards do not require recruiting additional validators if we adopt storage capacity as the securing instrument. Storage network requires a mechanism for validating storage proofs to maintain data availability. Storage providers submit proof on a regular interval. However, the requests could come from storage clients or the network. For example, Sia uses off-chain storage proofs and payment channels to maintain record of data possession. There are obvious advantages of not requiring storage clients to perform any ongoing maintenance work. The network could require the clients to lock up sufficient fund to pay for the cost of the storage duration, and pay the providers proportionate to the number of storage proofs provided. There are some DSN features that could be implemented by client libraries. For example, Sia and Storj enforce data replication and encryption. Filecoin enforce replication. I tend to believe that the storage network should leave those features as client options. A decentralized network is already complex. Storage clients could choose how their data would be encrypted, stored using erasure coding technique, or replicated. There are applications that want to store unencrypted data. There are applications that do not want to pay the extra costs of replication. Furthermore, these features could be built as client libraries to be compatible on multiple networks. Keeping these feature out of the core network simplifies the design of a storage network.","tags":"misc, , web3","url":"/2021-09-06-dsn","loc":"/2021-09-06-dsn"},{"title":"Notes on Blockchain Throughput","text":"Preliminaries In this blog post, I discuss down my observations on the throughputs of some of the major blockchains: Ethereum 2, Cosmo, Polkadot, and Solana. I choose to use byte throughout, byte/s, as a key metric to comparing different blockchains instead of transactions, tx/s. Most of of scalability comparisons focus on transaction throughput. See ex1 , ex2 , or ex3 . Transactions per second could be a misleading metric because different data models and layer 2 solutions could lead to dramatically different byte size per transaction. For example, Ethereum transactions averaged to about 500 bytes. If a blockchain only uses a 8 byte address space and take advantage of aggregate signatures to limit signature size to 2 byte, a transaction would be 10 bytes. For another example, state channels could also be used to put data off-chain to achieve scalability. Byte is a more generic measure to compare blockchain throughputs. Data on a public ledger could be used to encode vastly different messages, ranging from simple token transfers to arbitrary computations. Transaction data in Bitcoin is primarily used to do accounting for unspent transaction outputs. A general purpose blockchain such as Ethereum could use data to encode computations via op codes and contract states. Measuring a blockchain's throughput in bytes allows us to ignore the blockchain's computing paradigm, allowing us to focus on how its consensus mechanism and p2p network architecture limit what a blockchain could do. Bitcoin's block size is 1MB and there about 150 blocks per day. Bitcoin's data throughput is about 1.5 KB /s . Ethereum 3.5 KB /s . The two blockchain's data rates are remarkably similar. See Ethereum Average Block Size and Ethereum Blocks Per Day . Different data rates support different applications. For example, much of the decentralized finance applications are similar in that they record infrequent buying and selling activities of some blockchain assets. Assume that I make 100 transactions a year, with each transaction being about 100 bytes. I only need to 10KB of data. If there are 10 million people using the platform, 100 GB /year is required. That is about 4 KB /s . Ethereum could support this type of applications as of today. In fact, defi applications are thriving in Ethereum today. An application that uses more data would be a widely payment system. Visa is the probably the most frequently cited example showing the inadequate throughput of Bitcoin and Ethereum. Assuming Visa payment system processes 10,000 tx/s, it would be about 1 MB /s . This far exceeds the current 1-4 KB /s range in Ethereum. But it is insurmountable. We are likely to get 10-50x improvement in Eth2. If we get another performance boost through layer 2 technologies, it is not inconceivable that we will get to the payment system throughput requirement. I will discuss this later in the post. Gaming, messaging, social networking, or online advertising system have a much higher data throughput requirement. I would want to envision a decentralized internet where my data is not owned by a centralized private company. Let us assume that an average American uses 100 KB per day in emails, textings, and social media platforms. Assume 300 million people. The platform would have to support 300 MB /s . That sounds a lot, but there are a lot of enterprise applications that easily process data volume much greater than that. It will be hard for any blockchain to support this type of data throughput. I will look into details later in this blog post to explain why. Applications in this category will likely use off-chain data processing and storage techniques. It is a useful reminder that decentralization does not all data and computation have to be pushed on-chain. Eth2 Eth2 has by far the most detailed designs and open discussions of the all the major blockchains. The proposed specs shed the most light on what blockchain platforms will look like in the near future. Eth2 is a sharded blockchain. The initial target is 64 shards. The other important proposed targets are block size of 128 KB and a block time of 12 seconds. See mainnet spec , beacon spec , and note on shard chain proposal . The two key metrics that determine data throughput is the data capacity of one shard and the upper limit of the number of shards that the Eth2 could support. Shard Capacity The targeted throughput of one shard is roughly 10 KB /s . This target is already well calibrated for various theoretical and practical reasons. Vitalik has a good post explaining how the practical details of internet bandwidth, computing power, and storage dictate the limit the data throughput of a single shard. I do not fully agree with Vitalik's stated limit on bandwidth because requiring a gigabit bandwidth does not increase centralization risk. There are sufficient people who have residential gigabit or are able to leverage cloud resources. The accumulated size the chain state is a hard limit. It is not inconceivable that network participants are required to have large storage devices. The users could have 10 TB of SSD storage to store active data. For simplicity of argument, assume that the most recent blocks from the last 3 years are active data, and older blockers could be put into deep, networked storage. This would lead to a throughput of 100 KB /s. With such a large state db, where the hot data spans 10 TB , the best we could do to provide fast access to the state db is through memory mapped indices. Even assuming that the nodes are required to have 64 GB of main memory, majority of reads and writes would still require the system to swap pages. Even if all other validation steps are instantaneous, reading state itself slows down the event processing. There is sufficient evidences in the last decade of enterprise experiences from data queries and stream processing frameworks that a database processing node with 10 TB of hot data is unlikely to perform well. If the chain accumulates about 1 TB of data over 3 years, the throughput is about 10 KB /s. It should be noted that all of those limits could be increased. Instead of thinking the participating validator as a single machine, we could design that to be a processing cluster. The validator runs a cluster of workers to have both distributed storage and stream processing. However, this would dramatically increase centralization risk. The participants are likely to be enterprise users or highly motivated group of individuals. It would be easier for them to find each other and collude to compromise the integrity of the network. It cannot be overstated the importance of decentralization. If data is stored and processed only by a few large well-known participants, the overall ecosystem will emerge to look similar to today's internet. Decentralization requires the barriers of entry on participating in core network to be as small as possible. This requirement leads and some heuristics about shard throughput. For more details, see a post on scalability dilemma It is also important to point out that the limit on per shard throughput is not due to specifics of consensus mechanisms. A consensus mechanics limits the latency (block time) of when data is processed, but it does not limit the targeted block size. For a fixed block time, block size is limited by the data throughput of the validating nodes. One misunderstanding about overcoming the per shard capacity is that one could further shard the shards. That is a misguided solution. A transaction processing node in the network has an upper limit in its ability to hold a portion of the chain's state and process the assigned transactions. The appropriate partitioning of the network lead to a sharding design. The network could add more shards or split the shards further, but the network does not uniformly benefit further from a two level sharding design unless there are specific justified transaction network topologies. For example, a layer 1 shard could correspond to countries, and a layer 2 shard could correspond to cities within countries. This type of network design is not likely to be general purpose. I will talk about transaction network topology in the next section. Sharding Limit and Shard Network The primary shard is called the Beacon Chain, which supports the consensus mechanism. There is fixed amount of work that the Beacon Chain has to perform to keep tab of validators, proposer committees, block headers, etc. If shards do not talk to each other, the per shard data on the Beacon Chain is minimum. The Beacon Chain should be able to support thousands of shards if the only data being put on Beacon is block headers and other administrative accountings. However, cross-shard communication is a requirement for Eth2. The size of Beacon Chain scales linearly with the number of transactions that have cross-shard communications. Cross-shard linkage is still a work in progress. See roadmap , sharding research note , and research discussion . One implementation of cross-shard transactions could be aggregating log receipts in Beacon. Beacon would grow with the number of cross-shard transactions. If the fraction of cross-shard transactions is relatively stable, there would be a natural limit to the number of shards that Beacon could support. If the majority of the transactions are cross-shard, effectively Beacon has to make available all of those data. Jamming all the sharded data into one shard would quickly overwhelm the Beacon shard. The number of shards that Beacon could support is inversely proportionate to the fraction of transactions that are cross-shard. For example, if 1% of the transactions are cross-shard, we could expect Beacon to support 100 shards. Combining the heuristics of one shard throughput and shard number limit, it would suggest that a reasonable byte throughput upper bound is 1 MB /s if the cross-shard rate is 1%. If the cross-shard rate is 0.1%, and we would get 10 MB /s . Referencing our discussions earlier, it would mean that a global payment system could be implemented on chain. But we won't see micropayment system or social networks implemented directly on Eth2. Those applications will require layer 2 scaling solutions. The bottleneck of cross-shard communications puts an upper bound on the connectedness of the sharded topology. Shard to shard communications could not exceed the limit of a per shard capacity. For example, a global payment system is implemented on chain. Each zipcode's internal payments are placed in one shard. If the payment going inside and outside of that zipcode far exceeds the remaining capacity of the shard, the payment system gets overloaded. An application designer has to identify the eventual network architecture and understands if the bottleneck could be eliminated by grouping those users to a shard. It is possible that transaction topology is sufficiently connected that there is no sharding solution available given the per shard throughput. For those who are interested, I wrote a paper a long time ago on network bottleneck that had more details on how to identify and calculate network hotspots. Cosmo Cosmos is a decentralized network of independent parallel blockchains. Each cosmos-backed blockchain is secured by the Tindermint Byzantine Fault Tolerant ( BFT ) consensus. Application developers use the Cosmo SDK to write their own data model and state transition logic. A blockchain could exist completely independent of the rest of the Cosmo network. Each Cosmos blockchain is called a zone, and there are special Cosmos chains called hubs that coordinate cross-chain communications through IBC . Each zone has its own governance to decide how the validations are performed. The hub does not validate any transactions coming from zones. Each zone is essentially is a parallel chain. The throughput of each zone falls under a similar set of scalability requirements of a single Ethereum shard, 10 KB /s . If transactions do not have to be cross-zone, we could create as many parallel zones as possible for different applications. As long as there is no single application that exceeds the capacity of a single zone, there is no limit on throughput. However, an example application such as a global payment system would require many cross-chain communications. Such an application would have to designed with many zones and many hubs. Because cross-chain communications could be designed in any zone-hub connection, a payment application could divide users appropriately into different interconnected zone. The theoretical limit would be whether the overall network could support all the cross-chain links. For an extreme example, if users in the payment network form a complete network and there are frequent cross-links, even if each person takes up all the capacity of a dedicated zone and hub, the number of cross-links would overwhelm the dedicated hub. Most real life networks would be much more sparse. Still, application developers would have to be clever about managing sharding the users to different zones. This is no different than designing a sharded database in centralized web technologies. The key difference is that per zone capacity is small, and each zone's cross-chain throughput could only take up a fraction of that. Polkadot Polkadot uses a master Relay Chain to coordinate consensus. Transactions are finalized when they are ordered by the master chain. The system has a universal set of validators that are assigned to validate each of the parachains. Validators do not store state information, but they store block headers and block proofs. Validators are capable of ensuring that the blocks are valid without the need to build out the chain's history or current state. Each parachain has a set of Collators. Collators maintain the full data that is used to construct the state transitions for their assigned parachain. Collators generate the blocks and provide validity proofs of the generated blocks. The per node storage and computing throughput of Collators limits the throughput of a parachain. Each collator has to store the full parachain. This means that each parachain faces the same shard limit as a single Eth2 shard assuming Collators use similar hardwares as Eth2 validators. Validators collectively ensure block proofs are available through erasure coding. Assuming block proof is 1000th of the block size, then it is unlikely that storing block proofs will be a bottleneck because each validator would only need to store 1000th many bytes to a Collator, multiplying the shard number, which is 100. Hence a parachain would process about 10 KB /s. In Polkadot's proposal, it suggests that Collators are expected to run with enterprise level hardwares and network connections. We could expect the actual throughput to be multiple of that, gaining that at the expense of higher centralization risk. Polkadot enables direct cross-chain interaction through cross-chain message passing (xcmp). The content of cross chain messages do not get to Relay Chain, but the metadata of the messages are included in the Relay Chain. Cross-chain messaging implies two limits. First, the cross-chain messages cannot exceed the capacity of a single parachain. Second, the combined metadata of all cross-chain messages on all parachains cannot exceed the capacity of Relay Chain. This is similar to Eth2 Beacon Chain. The exception is that Beacon Chain might include the actual cross-shard messages, and Polkadot Relay Chain only include metadata. The fundamental limit is similar: the combined cross-shard traffic capacity is the throughput of one single shard. For this reason and probably also due to other nuances, the number of parachain is limited to 100. I would expect the Polkadot throughput to be around 1-20 MB /s . Solana Solana is not proposed to be a sharded blockchain, at least not as of the writing of this post. Solana aims to push the blockchain's throughput for a single machine. It has a high hardware requirements for validator. It recommends validator to have 16 cores, 256 GB of memory, TBs of SSD disk storage, and gigabit network. Solana aims to obtain a transaction throughput that matches a network's gigabit speed, 125 MB /s. It combines a number of techniques to remove network, consensus, and storage bottlenecks. The network has a concept of slot and epoch. An slot is about 400 ms, and an epoch is 432,000 slots, which is roughly 2-3 days. A leader is deterministically chosen every epoch. Its transaction forwarding protocol Gulf Stream ensures that the network could quickly route the transactions to the leader. The leader uses proof of history to order all the transactions. Tower BFT allows validators to quickly vote on the blocks. Fork choice rule is a result of proof of history and tower BFT . Furthermore, its turbine gossip protocol allows the out-going messages from the leader to reach validators without breaking the 125 MB /s network bottleneck. Furthermore, Solana proposes an archiving system to allow validator nodes to shed historical data to a storage network. These innovations are clever. It is believable that they could achieve the stated throughput at the network, consensus, and storage layer. That is, this throughput is possible if the state machine's operations from processing each tx are trivial operations. Processing transactions requires reading from and writing to the state. At 125 MB /s, it is about 4 PB per year. In theory, all of this data could be stateful updates. For example, if each transaction is a simple transaction writing key-value pair where the key is random and the value is strictly increasing of all existing values. The state size would grow quickly and easily exceed the limit of a single machine. Solana should not be able to handle the large state without distributing the reading and writing of the state, which are in the petabyte scale. The 125 MB /s throughput is only possible if almost all of the transaction data does not lead to stateful updates. This data processing problem is not unique to blockchain. Many enterprise companies have built data streaming processing framework to tackle similar problems. A processing system has to be able to split the work and coordinate stateful updates. As mentioned before, Eth2's data throughput is not limited by the shortcoming of its consensus mechanism. The 15 seconds block time in Eth1 is due to proof of work. The difficulty is adjusted so that blocks are created roughly once over that period. There is empirical research to show that 12 seconds delay is needed for the network to propagate and verify the transactions in a large decentralized system such as Bitcoin and Etheruem. The gossiping mechanism for propagating proposed blocks is slower than Solana's approach. That slows down blocktime but it is not a bottleneck for the overall data throughput. There are a number of heuristics that could reduce the targeted block time: smaller blocks, separate block headers and block contents (see this post ), etc. The true limit of data throughput is state size and transaction history, which is a result of storage requirement. Solana proposes key innovations to reduce bottlenecks around network, consensus, and storage. I do not believe that Solana could handle the theoretical 125MB /s throughput for arbitrary transaction data. Let's assume that a Solana validator could safely support a state size of 3 TB and that is accumulated in 3 years, and on averaged, 10% of the transaction data become the state. Solana would be able to support about 1 MB /s. That is still really impressive throughput. Final Words The throughput of a single shard blockchain is limited. This limit could be increased if each validator effectively runs a distributed computing and a storage cluster, making one shard a network of clusters. Assuming that per shard throughput is fixed, a network of shards increases throughput linearly. A sharded blockchain could be bottlenecked by cross-shard communications. If an application could tolerate a network topology where there are small complete networks encapsulated within shards and there are only sparse cross-shard linkage, the application could be supported on chain. A sharded blockchain could support throughput up to 1-20 MB /s while allowing reasonable cross-shard linkages. This would be sufficient to support a global payment system. Data heavy applications such as gaming, messaging, and social media will likely have to leverage layer 2 techniques.","tags":"misc, , web3","url":"/2021-07-23-blockchain_throughput","loc":"/2021-07-23-blockchain_throughput"},{"title":"Quote of the Month","text":"One can search for facts, but there is no shortcut to knowledge.","tags":"misc, , quote","url":"/2021-01-31-quote-of-the-month","loc":"/2021-01-31-quote-of-the-month"},{"title":"Standup Paddleboarding in the Bay Area","text":"The standup paddleboard ( SUP ) is a versatile watercraft. It is guaranteed fun wherever there is a big body of water. It is a great tool to explore coastlines, estuaries, rivers, or lakes. It is much easier to transport than kayaks and canoes. It could be launched with minimal water accessibility. When I want more action, I could take it out to face ocean waves! Equipments I distinguish surfing standup paddleboards from flat water paddleboards. I don't know much about flat water paddleboards. I had used inflatable, foam, or high end custom made surfing paddleboards on flat water. The all worked great. I had not tried to race or needed to haul overnight camping equipments on paddleboards. As long as the paddleboards have sufficient volume, it should work well enough to paddle around on lakes or up and down wide rivers. The aspects of a surfing standup paddleboard is hard to summarize. When I was starting, I only needed the board to be large enough to be stable. That was similar to how I learned to surf. A SUP with high volume provides sufficient stability to maneuver to right place at the right time to catch waves. Beyond the beginning stage, there are boards shaped specifically to a paddler's size, skills, surfing breaks, wave sizes, and style of standup surfing. You should not be reading this blog for tips on how to spend thousands of dollars to perfect your SUP quiver. I started with the trusty, world-famous Wavestorm foam SUP from Costco. It was a great tool to let me learn to catch waves up to 3-4 ft at Ocean Beach. But it is hard to surf on it. The board is too wide to put a rail into the water unless I stand two ft off center line, but then it would be impossible to go to the other rail. It is massive and heavy. Still, I love that board! Anyone could stand up on it even without any prior experiences. It is practically a boat, but without a hull. I could use it as a kayak substitute in almost all cases. I don't worry about dinging it because it is foam. It is cheap and has been sold as an amusing 2-pack, as if they are Target underwear. It is the best first SUP everyone should own. Find someone to split that 2-pack with you. SUP Surfing I got into supping because I wanted an alternative to longboarding. It allows me to cover a lot of ground and fight the lateral currents in my favorite beach break. I catch more waves. I can steer clear of any cluster of surfers to find my personal sand bars. I see more and stay warmer because I am out of the water. I surf my SUP almost in all spots where others surf, and also surf in many places where people don't want to surf. I don't take my SUPs out on bigger days or on crowded point breaks. My main motivation to standup-surf is to have a relaxed time with friends out in the ocean. My favorite standup surfing spots around the Bay area Ocean Beach, Rockaway beach, and Bolinas. Whenever I go on a camping trip along the coast or anywhere close to a big body of water, I almost always pack my standup paddleboards. It is always a surreal moment to catch the sunset on the water, regardless if there are breaking waves or not. SUP Exploring Once I had a couple of SUPs in my basement, I discovered these watercrafts offered far more than just another style of surfing ocean waves. I could take my non-surfing friends out to the water, and they would always have a great time. I could use it as a sea kayak to transport crab traps. I could launch them exploring any body of water I see on a map. A SUP water experience is quite different than kayaking. While kayaking, I usually don't want to flip the boat. While supping, I usually dress to get wet. I could easily jump into the water to cool off. I see more because I am standing up, but at the same time, I feel closer to the water because there are no barriers. A SUP is much easier to be transported than kayaks. I could walk my paddle and board easily from my car to launch points. It is easier to load onto my car. I could transport multiple boards for myself and my friends all in one car regardless whether they inflatable or otherwise. A solid roof rack for a car is must. Here is a selected list of launching spots and trips that I have gone near the San Francisco Bay Area. Picnic Area Crissy Field . Have a picnic with friends here. Launch from the beach. One can paddle toward Fort Mason or Fort Point. Watch out for currents, tide changes, and potentially tricky surf under the Golden Gate Bridge. Bolinas. Great beginning surfing spot. Great crabbing spot. Great spot to just paddle and explore the coastline and the lagoon. There are schools of leopard sharks in the lagoon; I have seen them. There are great whites out there in the ocean; I have not seen those. Drakes Estero. I launched from this dirt parking lot once. It was low tide, so it wasn't pretty. The two of us had to wade through some serious mud to get out. But once we were out, it was quiet and pure nature. Birds, harbor seals, and wild life everywhere. This water could be closed part of the year to protect the wild life. Don't get those hefty tickets. There might be better launch spots than the one I link here. Port of Oakland . Don't laugh. One wouldn't think that there are launch spots in Port of Oakland. It is a unique place to checkout some giant cranes and go under the Bay Bridge. It is not for everyone. Alameda crab cove or Encinal boat ramp . I was a student at Encinal High School. I always wanted to go explore that coastline on a boat. I didn't have the information or the money about how to do that when I was young. The water off this coast is not paddled often. You could usually find some quiet time paddling around there. You could paddle to see some retired battleships at Alameda Point. Clipper Cove at Treasure Island. There are not a whole lot to check out here. It is another calm place to get a full exercise on the water. Bradford Island Ferry . There is a large birding paradise just off the main waterways. There are some intermittent levees dividing the water highways and the lake-like body of water. It would be a hard place to launch kayaks off the jetty. One could easily carry down a paddleboard to the rocky jetty and just paddle away. Putah Creek near Winters. The river has many faces. The river landscape and fauna changes often. It has wide and deep sections. It has rapids. It has shallow parts where it could be challenging to kayak but easy to walk our paddleboards across the shallows. The water is clear and calm. There are many kind of water fowls and birds hanging out. If you go there, most likely you will be only people enjoying the water. Spring is the best time. More Launched Spots I will attempt these in the near future. Tomales Bay Antioch to Sherman Island Brannan Island State Recreation Area Alviso Marina County Park","tags":"misc, , water, surfing","url":"/2021-01-18-paddling-bay-area","loc":"/2021-01-18-paddling-bay-area"},{"title":"Elliptical Curve Encryption and Signature","text":"This post gives a hands-on tutorial on the basic concepts of elliptical curves needed to implement public key encryption and signature from first principles. I pick a specific curve, secp256k1, to keep things simple and concrete. This post is created to show that basic operations are really simple once the preliminaries are out of the way. For example: A private key is nothing but a 256 bit integer. A key exchange is just combining a private key and a public key: \\(S = \\rho_a H_b\\) A digital signature is as simple as calculating one number and sending it along with message: \\(s = \\frac{1}{k} (h + P_x \\rho)\\) A First Look at Curve Secp256k1 An elliptical curve is defined by the equation \\begin{equation} y^2 = x^3 + ax + b, \\text{ where } a=0 \\text{ and } b=7. \\end{equation} Furthermore, the following constants will be used to define the abelian group that is key to all cryptographic computations. p = 0xffffffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffe fffffc2f G_x = 0x79be667e f9dcbbac 55a06295 ce870b07 029bfcdb 2dce28d9 59f2815b 16f81798 G_y = 0x483ada77 26a3c465 5da4fbfc 0e1108a8 fd17b448 a6855419 9c47d08f fb10d4b8 n = 0xffffffff ffffffff ffffffff fffffffe baaedce6 af48a03b bfd25e8c d0364141 h = 1 \\(p\\) defines field of integers modulo \\(p\\) . \\(G\\) is the base point of the subgroup that I will describe shortly. \\(n\\) is the subgroup order and \\(h\\) the cofactor. See this page for more details. Because I am working with the special case of Secp256k1 curve in this post, I ignore the discussions on many potential vulnerabilities due to singularities, non-prime order, etc. Not all elliptical curves are secure choices. Group Construction from the Elliptical Curve We want to use the curve equation and constants to define a cyclic abelian group that we can used for cryptography computations. I don't think it is necessary, but if you want to, here is a good reference to refresh your group theory basics. Group elements Every (x, y) pair is an element in the group if the point is on the curve, and x and y are integer modulo p. That is, \\begin{align*} B := \\{ (x, y) \\: | \\: & y^2 = x^3 + 7 \\; (mod \\; p) \\\\ & x,y \\in \\mathbb{F}_p \\} \\end{align*} where \\(\\mathbb{F}_p\\) is the set of integers modulo \\(p\\) . Note that I use \\(B\\) to denote this parent group. The identity element is the point at infinity, denoted by \\(0\\) , such that for every \\(x \\in B, 0x = 0\\) . The inverse of \\(P=(P_x, P_y)\\) is \\begin{equation} -P :=(P_x, -P_y). \\end{equation} Geometrically, \\(-P\\) is the mirror point across the x-axis. Group operation I will use \\(+\\) to denote the binary operation, which we will call addition. The rule of addition is \\(P + Q + R = 0\\) if all three points are aligned. That is, if we draw a line on the graph, the three points that it touches add up to zero. Note that an inverse element, i.e. \\(-R\\) , is already defined. We can turn this geometric rule into algebraic form. Let \\(T = P + Q\\) . Let \\(m\\) be the slope of the line formed by \\(P\\) and \\(Q\\) . We have \\begin{equation} m = \\frac{P_y - Q_y}{P_x - Q_x}. \\end{equation} Note that the slope could be computed by taking the limit if P and Q are the same; in other words, we get the slope by computing \\(dy/dx\\) from the curve equation, \\begin{equation} m = \\frac{3P_x^2}{2P_y}. \\end{equation} The point \\(T = (T_x, T_y) = (m^2 - P_x - Q_x, m(P_x - T_x) - P_y)\\) . These calculations are much easier to visualize if you draw a straight line on an curve. You can use Wolfram Alpha if you are lazy. The rule of multiplication rule is a notation convenience, where \\(nP := P + ... + P\\) with \\(n\\) -many repetitions. The straightforward way of multiplying through pointwise addition is not computationally efficient. There are alternatives such as double-and-add and windowed method. Subgroup I did not show why the set \\(B\\) and the binary operation addition forms an abelian group. One could check the group laws easily. I am not going to show how the following base point is chosen, but note that the base point is chosen so that the cofactor is 1 . The base point \\(G = (G_x, G_y)\\) is G_x = 0x79be667e f9dcbbac 55a06295 ce870b07 029bfcdb 2dce28d9 59f2815b 16f81798 G_y = 0x483ada77 26a3c465 5da4fbfc 0e1108a8 fd17b448 a6855419 9c47d08f fb10d4b8 I abuse notation and use \\(G\\) to denote the cyclic subgroup generated by G. Note that \\(nG = 0\\) and \\((n+1) G = G\\) . The subgroup contains all the point of the parent group. The group order is n = 0xffffffff ffffffff ffffffff fffffffe baaedce6 af48a03b bfd25e8c d0364141 . So far, I have defined the group and the binary operation. We are set to do cryptography computation. This group is special because the for any \\(\\rho \\in \\{ 0, 1, \\cdots, n - 1 \\}\\) and \\(H = \\rho G\\) . The discrete log problem of recovering \\(\\rho\\) when \\(H\\) is known to be computationally hard. Public Key Encryption through Diffie-Hellman ( ECDH ) Keys A private key is random integer, \\(\\rho \\in \\{ 0, 1, \\cdots , n - 1 \\}\\) . It is 32 bytes in length. The associated public key ( \\(H\\) ) is the curve point associated with \\(\\rho\\) . That is, \\(H = \\rho G\\) . A public key could be written in two coordinates, each with 32 bytes. It could also convenient encoded with 32 bytes + 1 bit, or roughly 33 bytes. Note that one coordinate could be derived from another using the curve equation, and the extra bit denotes the quadrant. Key exchange Alice see \\(\\rho_a\\) , \\(H_a\\) , and \\(H_b\\) . The shared secret is \\begin{equation} S = \\rho_a H_b = \\rho_a (\\rho_b G). \\end{equation} Bob see \\(\\rho_b\\) , \\(H_a\\) , and \\(H_b\\) . The shared secret is \\begin{equation} S = \\rho_b H_a = \\rho_b (\\rho_a G). \\end{equation} Alice and Bob use the private secret \\(S\\) to to encrypt and decrypt via an symmetric algorithm such as AES . Note that the private keys could be ephemeral. For example, when elliptical curve is used in TLS , \\(\\rho_a\\) and \\(\\rho_b\\) are generated for each connection. They are discarded as soon as the shared secret \\(S\\) is generated for that session. Digital Signature Algorithm ( ECDSA ) Signing Alice has the private key \\(\\rho\\) , publishes the public key \\(H = \\rho G\\) , and she wants to signs the message \\(m\\) . Let \\(h = hash(m)\\) be the hahs of the message. The hash function does not need to be cryptographically strong. The purpose of the hash is to uniquely represent the message as an integer. The hash function is common knowledge. Alice generates a random \\(k \\in \\{ 0, 1, \\cdots, n - 1 \\}\\) . Let \\((P_x, P_y) := kG\\) . Alice calculates the signature as \\begin{equation} s = \\frac{1}{k} (h + P_x \\rho) \\end{equation} Alice sends the signed message as \\((m, s, P)\\) . Note that \\(P\\) is 33 bytes, and \\(s\\) is 32 bytes. The signature part is about 65 bytes long. Verification Bob receives the signed message \\((m, s, P)\\) . It is worth noting that it must be established a priori as common knowledge that \\(H\\) is Alice's public key. If \\(H\\) is part of the signed message, then anyone could act as a man in the middle and fake any signed message one prefers. Bob calculates \\(Q = \\frac{1}{s} (h G + P_x H)\\) . The message is verified if and only if \\(Q = P\\) . Here is a short proof of the verification requirement. \\begin{eqnarray*} Q &=& \\frac{1}{s} ( h G + P_x H ) \\\\ &=& \\frac{1}{s} ( h G + P_x \\rho G) \\\\ &=& \\frac{1}{s} ( h + P_x \\rho) G \\\\ &=& \\frac{1}{s} (h + P_x \\rho) G \\\\ &=& \\frac{k}{(h + P_x \\rho)} (h + P_x \\rho) G \\;\\;\\;\\;\\;\\;\\;\\;\\text{substituting } s\\\\ &=& kG = P \\end{eqnarray*} Note that we could just send \\(P_x\\) instead of \\(P\\) and only verified \\(P_x=Q_x\\) . Security and Key Length Given the public key \\(H\\) , how hard is it to brute force search for \\(\\rho\\) , where \\(H = \\rho G\\) ? The private key is any number less than \\(n\\) , \\(0 < \\rho < n\\) , where n = FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFE BAAEDCE6 AF48A03B BFD25E8C D0364141 = 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 Ignoring some technicalities that there are some invalid number less than \\(2^{256}\\) , a private key has bit length of 256 bit long. Here is an cost estimation of brute forcing a 256 bit key. It costs much more than any nation's GDP . However, there are more efficient algorithms to solve the discrete log problem than the brute force search. For examples, baby-step giant-step , index calculus , or Pollard's rho . The computation complexity of elliptic curve discrete logarithm problem ( ECDLP ) is an active area of research. See GG16 for a recent literature review. The best record have been based on variants of Pollard's rho algorithm, in the order of \\(\\sqrt{n}\\) . The security of a 256 key EC private key is roughly equivalent to a 128 bit key that does not have any special structure to exploit. This keylength site compiles tables on the security strength of the popular encryption schemes. The Shor's algorithm using quantum qubits could be used to break elliptical curve's discrete log problem. See KLM07 for a fun introduction to quantum computing.","tags":"misc, , crytography","url":"/2020-12-30-elliptical-curve-cryptography","loc":"/2020-12-30-elliptical-curve-cryptography"},{"title":"My Playgound Monorepo","text":"TLDR : Playground Monorepo Template I keep a playground monorepo as my primary workspace for personal work. The repo is my digital equivalent of a basement toolshed. It is the place where I go to express my ideas. Whenever I want to start a prototype or to test things out, I know exactly where to start writing the first line of code. My playground folders have slowly evolved to become a monorepo over the years. The reason to use a monorepo for my personal project is similar to why a team or a company would want to migrate to a single repo. I use different languages to experiment depending on the hobby and the experiments I am working on. I used to cmd+tab , cmd+~ , open, and close way too many editor windows. I want to share codes among different projects. I want to centralize my documentations. I want to have a single editor interface where I feel at home. I used to have many small repos. I would have a hard time remembering which one was which. I spent too much time setting up new repos. Sometimes I spent more time setting up than making progress on them. A playground monorepo has been a huge success for my personal productivity. I use it for more than just codes. I have organized my noncoding readings and writings into the playground as well. It is embarrassing to say that spending time in my playground sometimes gives me the feeling of working on a well organized basement bike shop with all the handy tools hanging on the wall. Here is the template that I extracted from my playground monorepo. My repo is much large than what is shown there, but they shared the same basic setup. A Single Build Tool: Bazel First and foremost, my playground repo is setup to write codes. In the last few years, I have become a fan of using Bazel as a multi-language and multi-platform build tool. There are many great resources out there explaining its virtues and shortcomings. I am not going to enumerate them here. For my purpose of finding a polyglot build tool for experimental codes, the #1 reason is how little I have to tinker the build tool once I set up a particular programming language. It just works, even when I just git clone my playground in a new computer. When I first decided to setup my playgrounds as a monorepo, all the outdated build tools in my old repos made me cringed. I forgot how to build and run the projects I created. Part of the motivation of putting everything in Bazel was my belief that Bazel build files were less likely to get outdated by yet another monorepo build tool. I didn't want to keep updating my monorepo build tool for each language every year. Many build tools change so quickly that it is tiring to try to keep up. Bazel is sufficiently mature and sophisticated that it is not likely to go through that kind of dizzying updates. Support Multiple Languages Before my playground monorepo, I had to setup every time I started something new. The most annoying part about setting up a new repo was the feeling of deja vu. I would get this sense that my world is glitchy. The second most annoying part is that I always had an urge that I need to apply the best practices of the latest build tool of the language that I am working on. That would easily lead me down a rabbit hole. I only wanted to spend as little time setting up as possible. When I have ideas, I want to write. I write in many languages. Sometimes it is beause there is the right tool for the job, but a lot of time, I am playing around and it is more fun to choose a stack that I want to experiment with. Building in Bazel offers similar experiences in defining library and binary targets regardless of the language of choice.For example, here are two target definitions # go go_library ( name = \"go_binary\" , srcs = [ \"main.go\" ], importpath = \"github.com/jinfwhuang/playmono/cmd/helloworld\" , ) # python py_binary ( name = \"py_binary\" , main = \"main.py\" , srcs = [ \"main.py\" ], ) Furthermore, executing these binaries are exactly the same. bazel run :go_binary bazel run :py_binary It removes a lot of mental switching cost when I shift from working in one language to another. Often, the highest cost in context switch is understanding the build tool and debugging environment. The programming languages and the ideas expressed in codes are usually intuitive and self explanatory. For me, a good analogy is with natural languages. I am more or less in autopilot when I switch talking to my parents in Chinese and talking to my wife in English when we are all in the same conversation together. However, trying to cook my normal dishes in my parents' all Chinese kitchen is really awkward. My parents' kitchen doesn't have my usual spices; it is hard to find an appropriate spatula; the pots and pans are all the wrong shapes and sizes. They have issues with my kitchen setup as well. It is not a perfect analogy. Most of the time, Bazel is a superior build tool than each of the language's best alternatives. It is like having a world class kitchen that is designed for both American and Chinese cookings. Monorepo as a Point of Centralization The greatest benefit of a monorepo is centralization. I can easily reuse my codes across projects. I don't have to copy and paste. I don't have to publish to Git, Maven, NPM , or PyPI just so I could consume it in a different project. Maybe a bit overlooked, but equally important is the feature that I could easily review and search all of my codes and notes in the same repo. Because it is not a shared codebase, I put a lot of unedited, completely disorganized notes right next to my playground codes. Beyond coding, my other personal computer time is spent on reading and writing. I have always kept a chronological order of folders to house the readings that are in PDFs. Back to my school days, most of the documents I produced were text and latex files. Nowaday, I more or less write markdowns exclusively. This blog is written in markdown. Even some of my email drafts were written in markdown in my playmono repo before they were copied into Gmail. I am so much more comfortable writing in my playground so that I do almost all of my writing tasks inside my playground IDE . This feature was a basic directory structure that adopted from how I used to organize my documents. $ tree -L 1 research/ research/ ├── 2010 ├── 2011 ├── 2012 ├── 2013 ├── 2014 ├── 2015 ├── 2016 ├── 2017 ├── 2019 ├── 2020 ├── 2021 ├── README.md ├── project1 └── project2 Infrastructure I don't deploy my toy projects often, but at any given point, I probably have at least a few services running in AWS . I stick my infra codes here in the monorepo as well. For example, see this basic directory structure . Bazel rules_docker makes the transition from building binaries to images seamless. When a binary is defined in Bazel, creating an equivalent distroless image is too easy. Furthermore, rule_gitops is used to connect docker images and deployment files. See an example of deployment . While this feature is not used often in my personal playground, the overhead is low enough that I had converted my previous services deployed in EC2 into this deployment structure. I keep a tiny k8s cluster running in AWS . Once that was setup, instead of having to deal with terraform and ansible to deploy toy projects, I would deploy to my personal k8s cluster by one bazel command. For example, # Build and push docker image, create deployment manifest, and deploy to k8s bazel run //deployment/helloworld:helloworld-namespace1.apply Configure Laptops My monorepo also configures my personal macbook laptops. I use ansible to keep track of my dot files, application configuration files, and most of the system applications that could be installed through command lines. I used to keep that in a separate repo. I consolidated the setup into my monorepo. Here is an example structure . $ tree -L 1 . ├── dotfiles ├── hosts ├── main.yml ├── playbooks ├── roles # sync all dot files ansible-playbook -i hosts main.yml --tags dotfiles # full sync ansible-playbook -i hosts main.yml This allows me to version control my dot files and automate setting up my laptops. I might write a detailed post about how I implement this feature.","tags":"misc, , monorepo, bazel","url":"/2020-12-29-playground-monorepo","loc":"/2020-12-29-playground-monorepo"},{"title":"Quote of the Month","text":"The cure for anything is salt water.","tags":"misc, , quote","url":"/2020-12-15-quote-of-the-month","loc":"/2020-12-15-quote-of-the-month"},{"title":"Analytics Database","text":"In this post, I discuss how I would choose an analytics database as of early 2020. Choosing a database technology is a huge decision because it constrains how I am going to build the data loading pipeline, query patterns, and database administration. In this post, I ignore the consideration of database administration. When I am evaluating database technologies, at a minimum I will do my best to understand the architecture query engine and storage engine . To understand the query engine, I review the in-memory data mode and how worker nodes coordinate smaller computing tasks. To understand the storage engine, I review the file format and storage backends. Examples of file formats are ORC , parquet, and custom-built format; examples of storage backends are apache Kudu, apache Bookeeper, object store such as S3, and custom-built storage layers that are collocated with compute workers. I distinguish between two types of requirements. The first type is for batch processing, and the other is to serve user-facing, realtime queries. Batch Queries for batch processing could take anywhere from seconds to hours to days. The query engines are used to generate reports and ad-hoc analysis. Some data workloads could bypass the query engines and access the data directly. The majority of the hyped-up databases fall into this category. Most of these solutions work well enough. Query performances are generally acceptable. Using SQL query is likely to be the easiest way to slice and dice the data. Once I figure out that a particular solution is sufficient for my query patterns and performance requirements, the other questions that I would ask are: What are my choices for storage engines? Can I use other computing frameworks (Beam, Flink, Spark, MapReduce) on my data? How do I load data into the storage engines? How do I deploy and administer the database? Some of the most obvious choices are: Dremio Presto Hive Spark SQL Snowflake Snowflake stands out because it is a fully managed solution. It does not expose its storage backends for other tools to connect to. It usually means that the data platform needs to keep another copy of the raw data in an object store (e.g. S3) before the data is loaded into Snowflake. Realtime Realtime analytics queries need to be completed in seconds, if not less. All of the query engines listed in the previous section are not designed to serve user-facing queries. Even if we tune those engines to have reasonable realtime results, they are usually expensive and deliver horrendous tail latency, which will take years to learn and optimize. Realtime analytics should take advantage of as many of the following features as possible: minimize network: Majority of the data are collocated at compute nodes. minimize disk: Compute nodes caches hot data in memory. memory mapping: The file format should be designed to be compatible with the in-memory data model. Memory mapping avoids data copying from kernel space into user space. The data model uses page markers to allow for partial scans. optimized in-memory layout: The compute engine performs computation without wasting cycles on converting data types. In my experience, even if the most appropriate query engine is chosen, there is going to be a need to preaggregate data and remove unnecessary columns to get acceptable query performances. The options for realtime analytics are limited and flawed. My top choices are: Druid( dru20 ) Pinot( pin20 ) Elasticsearch Both Druid and Pinot are designed to be realtime analytics engines ( Lev18 ). One can read their documentation for details. I have to mention Elasticsearch as a valid option here even though it is designed for full-text document search. Elasticsearch has a simple design that horizontally scales out its search and aggregation to each of its shards. Data collocation, horizontal scaling, aggressive in-memory processing, and flexible aggregation semantics are sufficient to make it a passable realtime analytics engine for basic aggregation and bucketing that are typical of user-facing dashboards. Data Storage I would consider data storage backends after I identify the query engine most appropriate for the query patterns. Sometimes I do not have a choice. For example, Elasticsearch has built-in storage layer. Druid's storage format in the form of Druid segments and metadata is fixed. However, if the engine of choice is Presto, we would have to choose the storage backend and file formats. My personal preference is leverage on an object storage as much as I can. S3 is dead simple. I tend to prefer Parquet if I have a choice. For example, one of my favorites is to organize data as partition datasets on S3 as Parquet files. table_name/partition_key1/partition_key2/partition_key3 These data could be used by Preso, Hive, Dremio, and Spark SQL directly. Besides using SQL , I can easily write Spark or Flink jobs to process them. I can write background batch jobs on the data without any frameworks. This style of storage engine is flexible and supports high throughput on a batch mode. Realtime Data and Realtime Analytics One of the most desirable features of an analytics system is have realtime query performances on realtime data. This is hard, and there should be an impossibility theorem out there to sound the alarms for the uninitialized. The perfect solution does not exist, it is a matter of tradeoffs. Achieving realtime analytics is hard enough. That is, even if we can assume that the data format and storage are fixed and could be optimized for answering queries, fast realtime queries are still difficult. In the best case scenario when we have data collocating, data model compatible with memory-mapping, and optimized in-memory data representation, we might still have to rely on pre-aggregation to have reasonable query performances. When a query need to scan through large amount of data, there is a theoretical limit on time needed to complete based on the compute resources. All a perfect query engine could do is to get close to that limit. Allowing for real time data loading invariably degrades the ability the query engine could optimize compute. For example, if we want data to show up more quickly, we could build more but smaller files. A large number of small file leads to query slow down. We could update the existing files, but the query engine have to swap out those files. We could stream the realtime data points into the query engine, but those data points would have to be limited otherwise the query engine could spent all of its compute cycle acting as a stream processor. It does not mean that we cannot improve on the existing solutions to have both better realtime data and query. One promising data storage engine is Kudu ( kud20 ). It looks like that it has good integration with some query engines (e.g. Presto) and processing framework (e.g. Flink). However, none of the query engine that connects to Kudu are designed for realtime. In a way, Kudu could not be a storage engine for a best in-class realtime query engine because it is not designed to be memory-mapped and not designed to be collated with compute nodes. Another way to achieve realtime query results is to think about how to optimize some of the more popular batch query engines. Take Dremio as an example. If it is computing on Parquet on S3, there is already too much time spent on network and converting data into Arrow in-memory representation. A more realtime \"Dremio\" can get around those two bottlenecks. It needs to introduce stickness between data and compute nodes. It would need to use a file format that could be memory map onto Arrow layout, e.g. Arrow IPC file format ( arr20 ). Given the tradeoffs between storage engine and realtime query engines, the best strategy to deliver realtime queries is probably pre-computation, building aggregated tables that comes as close to your query patterns as possible.","tags":"misc, , database","url":"/2020-02-03-analytics-database","loc":"/2020-02-03-analytics-database"},{"title":"Streaming Data Platform","text":"I have been building streaming data platforms in the past decade. Streaming data platform has many benefits ( Jcs13 , Fow06 , Fow17 , Aki15 ), and often, they are the most natural choice to organize data pipelines. I will describe the basic structure of a streaming data platform, technology choices, and common challenges. I am only writing down short opinions to explain my personal preferences. Please refer to original documentations for more details. A streaming data platform has two basic parts: data pipe and processor. Data pipe could be called message queue, event bus, distributed log, etc. It is the part of the system that stores the event data. Processors sit between different instances of the data pipe, connecting the pipes to form the topology of the streaming platform. Data Pipe The data pipe has to be rock solid. Any issues with this component cause the whole system to alert, and potentially lead to irreversible data losses. In my experiences, interacting with the data pipe's APIs are straight forward, the most common challenges centered on cluster administrations. Here are some questions to ask: How do I maintain configurations of my cluster? How do I scale up and down the cluster? How do I upgrade my cluster? How do I migrate my cluster to a new environment? How do I mitigate data center failures? Here are some common technology choices: Kafka: It is by far the most popular and most mature solution. Pulsar: It could be an alternative. One of the most compelling features is the existence of a Presto connector that works with the Pulsar's storage backend. One can set up Presto to query the streaming data directly ( pul20 ). Kinesis: It is a managed solution. It should be noted that once the clusters and usage grow. Data center failures will become a legitimate concern. Federation is one way to add flexibility to cluster administrations. For example, see FD19 . However, it is an effort to add that capability, likely requiring a whole team to build and maintain that feature. Deploying data pipe clusters on Kubernetes could have many operational benefits. It could unify data-heavy clusters configurations and application deployments, reducing the complexity of the team's overall devops process. The operator approach on k8s is increasingly popular. I have good experiences using ban20 for medium size Kafka clusters. But the hesitation on going all-in with k8s Kafka is about administration, again! One would have to think through the potential possibilities of cluster migration and cluster federation. K8s operators might not be flexible enough. Processor The core of a streaming data platform is data processors. Processors roughly fall into three categories: source, internal processor, and sink. The hardest processor features to design properly are exactly once processing, state management, and system level monitoring. Achieving exactly once is hard because the applications have to be able to recover from a variety of failure scenarios. Kafka features such as idempotent producer and transaction have made building exactly-once semantics a lot easier (see GJM+19 ). When processors have states, the offset and state management would have to work together. Usually, the states have to be backed up whenever the offset is committed. There are a few approaches that I would go about building these processors in Kafka: Rely on Kafka primitives Build on top of Kafka stream, see kaf20b Build on top of Flink; see fli20a and fli20b Build on top of Spark Stream, see spa20 Monitoring is often overlooked, and it is hard to generalize. The challenge is to be able to have a watermark system in place to inform the progress of the whole processing pipeline. This requirement varies from product to product. For example, a streaming system have stock ticker prices as input events. There are 5 processors, and the last processor output a real time signal about buy or sell. The monitoring questions we would ask could be: How timely is this signal? The definition of \"timely\" could vary a lot depending on how the signal is calculated. Watermark is a generic term to refer to whatever information that is needed to allow downstream applications to make timely decisions. In many cases, this information cannot be part burned into the messages itself. Examples of such information is the failure rates of internal processors, number of unprocessed events, late arrival events, etc. I have not seen any framework that addresses this need. One common pattern to building a system monitor is using yet another data streaming pipeline. This monitoring pipeline has to be dead simple and its failures be independent from what it is monitoring. Source Processor A ource processor could receive external traffic or process data from an object store (S3), and then write the data into the pipe. Source processor is different from internal and sink processor because it cannot use the offset marker in the data pipe to keep track of the progress to accomplish exactly-once-semantics. For example, if the data pipe is Kafka. The hard problem of periodic batch job producing into Kafka is defining job uniqueness and how to recover from job failures. The hard problem of edge servers persisting event into Kafka is coordinating the timing of Kafka receipt acknowledgement and responding to client requests. Sink Processor A sink processor takes data from the data pipe and load them into a storage engine. This component is not difficult to write, but it invariably feels like a painful step in building a data platform. It is partly because it is a common feature that every system needs, and yet we have to manually biuld this component from scratch. It is also partly because we have to write this component many times to target different storage engines based on how the data is used later on. In my experiences, this component is usually handcrafted on top of the same framework as the other internal processors. For example, if other internal processors use Kafka primitives directly, the sinks are built similarly. If the platform already uses Flink, the sinks are written using Flink. I would definitely hope to see a mature solution targeting the most common storage engines. Kafka connect is a possibility ( kaf20a ). It is tied into the Confluent platform, and it is not sufficiently flexible to be composable, and cannot be easily integrated into custom-built processors. Materializing the Data Pipe One of the pain points of a streaming platform is getting insights on in-flight data. One common way is to build many sink processors targeting intermediate stages, loading the data into an analytics database. See my post on analytics databases. This approach is tedious and lead to more applications to maintain. Another way is write a processor that exposes materialized views. In the past, we have to write the data models and processing logic for these views, but in recent years, we are lucky to have these frameworks to help us to write this component. ksqldb Materialize Additionally, this strategy is a great alternative to using a realtime query engine to serve user-facing queries.","tags":"misc, , database, streaming","url":"/2020-02-01-data-streaming","loc":"/2020-02-01-data-streaming"},{"title":"Long Time No Java","text":"I have not written Java in more than three years. I started working on Java applications earlier this year. This post documents some of the Java quirks, tooling, and best practices that caught my eyes in the first 3 months of meeting Java again. It is intended to be useful for others (and my future self) who experience some gap years with Java. The items and references mentioned here are not comprehensive. They are meant to be entry points for the interested readers. I might write detailed posts about specific topics in the future. Java Quirks Effective Java ( Blo18 ). This is the best refresher. After many years, this is still the best book on the best practices of Java programming. This updated version offers guidelines on the functional features on Lamda and Streams. This had been a must-read, and it still is. Pseudo pass by value. I could not choose the exact semantics of parameters passing, e.g. by value, by reference, and by pointer, as I would in c++ . The mode of parameter passing is fixed in Java. It is technically always pass by value , but that is confusing terminology. For object parameters, it is effectively pass by value of the pointer. If the variable is reassigned, the original value is unaffected. If the object's members are reassigned, the original value are affected. For primitive types, it is pass by value. If the variable is reassigned, the original value is unaffected. Check Exceptions. See che14 , Blo18 I still do not have a strong opinion on this feature. In many code bases, it feels pretty random on how folks are passing up and down the chain of checked exception. In rare cases when I am writing a small library, it makes sense to indicate expected issues through checked exception. This features feels awkward. Everyone has an opinion on it in code reviews and uses up a lot of back and forths. I am not convinced one way or the other. I recommend minimize its usage as much as possible. I would not object to a style guide and linting tool that warn against all checked exceptions. Prefer protobuf or thrift over Java serialization. Prefer StringBuilder. StringBuffer is synchronized. StringBuilder is not. Use file utilities from Guava or Apache commons JVM exiting could be tricky. Java class path ordering matters AtomicInteger is lock free, using compare-and-swap ( CAS ) hardware instruction in x86. synchronize is lock based. However, synchronize only uses CAS and work similarly as a lock-free algorithm when there is no contention. If contended, it uses spin locking and then fall back to system calls (e.g. futex). See Tho11 , tho11 , Dic06 , Goe04 , ato14 The utility in Concurrent Collections makes extensive use of lock free algorithms Java non-blocking IO is not the same as Asynchronous IO . AIO is confusingly termed NIO.2 . See disambiguiation and this post . Java has good support for memory-mapped file IO Java Concurrency in Practice ( Goe06 ). Going through this book, even just the table of contents, is a good use of time to get a good download on most of the core concurrent concepts in Java. This does not include actor model made popular by Scala and Akka. Library Choices Google Guava : Its collection idioms are still the best. Jackson for json Http Client Library: If I am developing in Java 11, I would go with the client that comes with the standard library. In older versions, I would choose one of these three: Jersey : My top pick. Google Http Java Client Apache Http Client OK Http HTTP Server Library. I had never been fully on board with any of HTTP REST web frameworks in the past. As of the summer of 2019, my opinion has not changed. I would like to see a lightweight framework that are easy to get started, logically compact, and easy to upgrade. Easy to get started. Spring Boot and DropWizard are good in this regard. Opinionated and hardened designs on its API and usage. A minimum set of well-chosen dependencies. The library maintainers need to be disciplined to keep the dependency graph small. Existing frameworks are hard to upgrade due to conflicting dependencies issues. Make stuffs optional: Features such ORM , Swagger, and Metrics should be are extensions. If I have to create a lightweight REST API server, I would base on Jersey . See an example ). It is worth mentioning a few REST alternatives: GraphQL, GRPC , and Twirp. Yourkit is a decent profiling tool. Dependency The diamond dependency problem will show up one way or the other. Regardless if it is a monorepo or multi-repo setup, dependency issues cause significant headaches and time spent on trouble shooting. The problem will show up unexpectedly when there are attempts to upgrade Java version, introduce a new computing framework, use a new build tool, upgrade old libraries, or add features to legacy applications. The worst offenders are also the most popular libraries and frameworks: Guava, Jackson, Jetty, Jersey, Dropwizard, Spring, etc. Issues on libraries such Guava or Jackson are usually easy to resolve. Issues caused by Jetty and Jersey are still manageable. When it comes to Spring, it requires a healthy dosage of mental fortitude as much as debugging experiences. It speaks volume to the importance of choosing technologies with a long view. The most common strategy is to specify a fixed version of the conflicting library. However, that is easier said than done. In all likely scenarios, the dependency graph is generated through walking down transitive dependencies. There are likely too many conflicting libraries to manually inspect by hand. The hard part is to identify a smallest set that works for the application in question. It is an art. Tooling For dependency management, I would only consider Maven and Ivy . Both work well enough, and most build tools will be able to work pull and push dependencies from both type of repositories. However, each build tool has a strong preference over one or the other. Maven : This is by far the most popular build tool. It has been around for a long time. It is the simplest to use. It works out of the box. It is designed to work for Java projects. I would not try to fit this tool to work with any other languages. Gradle : It is a general purpose build tool. I do not have experiences using this for a monorepo including multiple languages. I would choose Bazel instead if that is the goal. Ant : It is a general purpose build tool. Make : Good luck! If it is a toy project with multiple languages, it is fun to use make. It would not be practical for any production system. Bazel : With the maturing of the Bazel rule on using Maven artifacts , I am choosing Bazel for both small and large projects for java-only as well as monorepo supporting multiple languages. I will write a post and example on how I maintain my bazel-based playground monorepo supporting go, python, c++, java, scala, and javascript. Application Setup Spring Framework: NO ! I would avoid using Spring for dependency injection, configuration, bean life cycle, task scheduling, datasources mapping, etc. No compile time checks. Misconfigurations show up at runtime. It is hard to debug. A bloated library. It always causes unexpected dependency problems. It always did, and always will. Hard to test components Too flexible and hard to enforce usage guidelines. Invariably, bad patterns will show up and it will be a losing battle maintain sanity. Dependency Injection: My current vote is a no. I would architect my applications to be sufficiently small and wire them by hand. If I have to use one because for so many other reasons, the application has to be monolithically large and complex, I would go with Dagger. The history of major dependency injection frameworks in Java is a starting point to learn more about Spring, Guice, and Dagger. Configuration: There are three common ways to store configurations: command line args, config files, and environment variables. There are good reasons to use any of the three. A design principle like 12 factor might or might not work well for your CI / CD and production environments. But in terms of simplicity of user experiences, I would say using only command line args is the simplest, followed by config files, and then environment variables. I prefer staying with the simplest model until it is necessary to add complexity. Style I prefer to choose a style and setup a formatter. Debating how to line break while reviewing a feature release is counter-productive, but I totally get it that there are indents and long lines could lead to compulsively typing \"ugly\". I would pick Google java style guide and integrates the formatter into the build tool. The formatter could be run automatically. go-fmt says it the best. Formatted code is: easier to write : never worry about minor formatting concerns while hacking away, easier to read : when all code looks the same you need not mentally convert others' formatting style into something you can understand. easier to maintain : mechanical changes to the source don't cause unrelated changes to the file's formatting; diffs show only the real changes. uncontroversial : never have a debate about spacing or brace position ever again! Fun reads in Java zer16 , PN08 Jcs13 Sor13 , Java memory usage in docker","tags":"misc, , java","url":"/2019-06-01-long-time-no-java","loc":"/2019-06-01-long-time-no-java"},{"title":"Quote of the Month","text":"With consistency a great soul has simply nothing to do. Speak what you think today in words as hard as cannon balls, and tomorrow speak what tomorrow thinks in hard words again, though it contradicts everything you said today. - R.W. Emerson","tags":"misc, , quote","url":"/2019-04-13-quote-of-the-month","loc":"/2019-04-13-quote-of-the-month"},{"title":"Learning to Surf in San Francisco Area","text":"I started surfing when I was 30 years old. I missed out on the experiences of having my twenties to travel the world to find perfect waves and enjoy indescribable surfing-outs at remote locations. Still, I wouldn't change a thing. I appreciated the learning process more than an averaged surfer. My first experience was a summer roadtrip on the central coast in California. A few friends and I rented some boards and drove around to find waves. None of us knew what to look for. We ended up surfing a well-known beginner's surfing beach, a nice beachbreak with no actual lineups, and a white water only beachbreak at a state park with no one else but the four of us. Surfing at the state park was one of my favorite sunset surf sessions to this day. Over the years, I have written many tutorial emails to friends who want to give surfing a try. Here is tutorial post I put together for the next surfing learner. It is the introduction guide that I wish that I had on my hand when I first started. Where to Surf Surfing learners should go to well-known beginner surf spots most of the time. Within 90 minutes drive from San Francisco, the best bets are Linda Mar, Bolinas, Princenton Jetty, Cowell, and Pleasure Point. When the swells are small, one could try Ocean Beach, Rockaway, Waddell, Davenport Landing, and low tide Middle Peak at Steamer Lane. My favorite combo when I was learning was Linda Mar and Ocean Beach. Between those two spots, I could find rideable waves 350 days a year. Cycling through these spots should give you plenty of water time, getting familiar with equipments, observing the good and the bad parts of surf culture, exploring the California coast, and respecting the ocean. Once you get enough experiences to level up to be an intermediate surfer. I would recommend the classic book Surfing California . Our land has changed, there are way more people in and out of the water, but waves still break more or less the same as they did 100s of years ago. Gears A surfer learner should start on a stable surfboard that has plenty of volume . It usually means a beat up longboard or a foam board. The board of choice should be at least 8 ft long. My personal favorite is the trusty Costco Wavestorm. I started on that board, and I still love surfing it on closeout days. Taking out my waterlogged wavestorm on a summer day to catch waves on some uncrowded sand banks in San Francisco is one of the purest surfing experiences regardless of skill levels. Getting a quality wetsuit and booties will keep the coldest person feeling cozy in the coldest winter months. I run as cold as anyone I have met in my life. My hand and feet would go numb walking around in the summer months in San Francisco after the sun goes down. They are also going numb as I am typing this blog in my home office. Yes, the heat is on. I had always thought that this part of the coast was too cold for me to spend more than 15 minutes swimming. Putting on a quality wetsuit changed all of that. They work better than advertised. The water temperature could go down to the high 40s in winter. I wear a 5/4 with a hood. Many people wear 4/3. Here is a little-known fact: the thickness matters less than the dollar amount you spend and its wear and tear. In the summer, the water temperature is better but still cold. Any 4/3 wetsuit, even a cheap one or a few-years-old wetsuit, is enough to keep me warm and happy all session long. Surf Forecast Surfline is the standard surf forecast tool every surfer uses. Magicseaweed is a good alternative because its free version offers a week ahead forecast while Surfline's free version is only one day ahead. Cross referencing information from a wind specific forecast site, such as Windy , is helpful for a more nuanced forecaster. The basic concepts governing wave qualities are wind direction, wind speed, wind gust, swell direction, swell height, swell period, tide, the specifics of the breaks (bathymetry). The only way to reliably predict quality surfs is by accumulating countless experiences in analyzing onscreen metrics and seeing surf conditions in person. Every break works differently. It is not hard to acquire that pattern recognition for the breaks you visit often. Surfline's overall indicator \"epic/good/fair/poor\" is not accurate often enough. Instead, I use the raw numbers on wind, swell, and tide to make my own forecast. Surfline could easily have a better prediction model if they collect user feedback and implement a simple nonlinear model. It should work really well for the popular breaks. Catching Waves and Popping Up In the beginning, everyone wants catch a wave, pop up, and ride away as one with the ocean energy. I would advise perseverance and enjoy the process no matter what. I learned everything through watching youtube, from surfing on white water to catching small unbroken waves and enjoying clean 4-6 ft waves at Ocean Beach. You can take classes, get help with surf coaches, ask friends, or just go out surfing a lot. You are going to improve as long as you are having fun in the water. I am going to list some tutorial videos to get you started. If you are serious about learning, these will just be your starting points. You will probably watch a thousand hours worth of surf contents. It will help your surfing. Having a mental picture of what your body should be doing is likely to allow your body to develop those muscle memories. Paddling Tutorial 1 Paddling Tutorial 2 Paddling Tutorial 3 Paddling Out on a Longboard Surf Simply Tutorials: Catching Unbroken Waves Surf Simply Tutorials: Paddling To Spot X Surf with Amigas: How to Popup Here is a trick speeding up the learning of the dreaded popup: practice on a yoga mat. When I was learning in the first 6 months, I calculated that every 2 hours I spent on the water, I have about 20 popup attempts. I was always messing up my popups. I started doing 100 popups a day when I was on a trip not anywhere close to the ocean. In that few weeks, I ended up with more popup attempts than I could have accumulated in a year's worth of water time. I actually came back from the trip popping up too fast and had to learn to slow down to better time my popup with the breaking waves. Gaining Proficiency Moving beyond the absolute beginner stage, the next best thing is learning the vocabulary of what you don't know. I would recommend watching the 110% surfing video series cover to cover. You could easily get into a youtube rabbit hole to get sufficient exposures. Here are some links to prime your recommendation engine: How the world's best surfers pop up Shortboard Pop Up Take Off & Heavy Drop (goofy version) Catching More Waves Bottom Turn I have been following these basics to reach my personal surfing goals. Get into the water often Watch surf videos and visualize Apply fitness exercises to increase overall flexibility, upper body strength, core stability, and injury prevention Surfing Clips, Reads, and History Surfing media has a lot of hidden gems, especially if you are new to surfing. There are personalities, adventures, innovations, and everything in between. Going through the history of surfing is a joy in and of itself. I probably spend just as much time surfing the internet on surf culture. A must read if you are surfing in SF , an excerpt from Barbarian Days Surfing with Sartre Great Highway Trailer Nat Young at SF OB Old Surf Movies: Fort Point, 1972-1975 The Weirdest and Most Wonderful Waves Momentum: Under the Influence Shane Dorian's first session at Mavz John John Florence vs Kelly Slater ‘14 Tahiti Rob Machado and Sig Zane Larry Bertlemann Fun Waves and A Really Good Song Blackboy from Batu Karas Gerry Lopez discovered Uluwatu Surfing in Cold Places","tags":"misc, , water, surfing","url":"/2019-04-01-learning-to-surf-sf","loc":"/2019-04-01-learning-to-surf-sf"}]};