In this post we'll go over a bunch of the optimizations done to this code to make it usable and performant. There are still a few outstanding optimizations that should be done so I might do another post like this in the future.
- Future Optimizations
- FRP Optimization
- Memory Leak
- Reducing Churn
- Retrieving Colors
- Tighter Code
The Model is responsible for reading, processing, and holding the WAV data used by the application. Each of these aspects has had optimizations applied to it, which we'll go over in turn here.
WAV files contain a sequence of (multi-byte) audio samples, each of which has to be read from the disk. In the inital implementation of file reading in this application this was performed one sample at a time using a plain file handle. Beacuse of the disparity in speed between processing a sample in the processor and fetching it from disk, this process was sloooow -- disk fetches are a couple of orders of magnitude slower, even with an SSD.
To decode samples from the bytes of a file, I'm using Byte Strings. After a bit of research I found that if I were to read the file using Lazy Byte Strings, using
hGetContents instead of a file handle, it would fetch the data from the disk into RAM/cache automatically in the background in chunks of 32k at a time, which helped speed things up.
Furthermore, by decoding multiple samples from the Byte String at a time, instead of just one, things are sped up even more (which I believe is due to what data is available in the cache). Currently I'm decoding 512 samples' worth of data at a time during reading (having roughly tuned to that number) and as a consequence the program can read and decode 125mb from disk in ~2s on my relatively new system.
After reading the WAV data from disk it's stored in memory for processing and, after that, for playback. Because of the size of the files I expect people to want to edit in this application -- on the order of hundreds of megabytes -- storage efficiency was a concern.
Lists are a go-to data-structure in Haskell for storing sequences of data, but they are not that efficient and also don't allow for linear access to the data they contain, which is needed for this application. My attention therefore turned to the more efficient Vector data-structure, which does provide linear access.
Vectors come in a surprising number of flavors in Haskell, so it took me a bit of time and research to figure out which one I wanted to use. The regular Vector type stores boxed data, meaning that each item is actually a pointer to the data for that entry, which is stored in some other location. Keeping pointers to samples along with the samples themselves didn't seem like the most efficent use of space, so I kept looking.
There are three types of Vector that store unboxed data -- the raw data itself rather than pointers to it: Primitive Vectors, Unboxed Vectors, and Storable Vectors. The SDL2 Haskell wrapper library uses Storable Vectors for passing audio data to the sound card, so I decided to go with this flavor of Vector since I thought it would be handy to pass loaded WAV data directly to the audio system. (Storable data is pinned, meaning it won't be moved around by the garbage collector so that it can be passed back and forth to other code via the FFI.) I didn't actually end up passing the data directly in this way due to various reasons, so I might switch to one of the other types of unboxed Vectors in the future.
When constructing Vectors there are mutable versions of each, which let you write data to them (within a monad) and then freeze them into the non-mutable version when you're done; this is how I'm creating my Vector of WAV data. Freezing normally involves a making a copy of the mutable version, which, given the amount of data that would be loaded into memory in this program, seemed inadvisable, so I opted to use the unsafe version of
freeze, which doesn't do the copy but rather freezes in-place.
Not Keeping Data Around
In addition to storing the WAV data in memory I was initally keeping a copy of it in decibel form as well, since I was using that multiple times during processing, and generating it is somewehat expensive. However, since the decibel form is only used during the loading process, it was a waste to keep it around longer than that, so I stopped doing that and just generate it as needed now. One of the use-cases I'd had for it has since been refactored away too, so it only needs to be generated once now meaning there's no performance loss anymore.
There are a couple of processing steps the WAV data undergoes after it's loaded: segementation and visuals generation. The latter is something I'll cover in a future article as I want to go over the algorithm and data-structure used for displaying a waveform in detail, so I'll only cover the segmentation optimization here.
Segmentation is done to separate clips of audio from gaps of silence for editing; this is done by comparing each WAV sample to a noise-floor threshold that the user configures, grouping samples above the threshold into clips and samples below it into gaps.
The threshold used for comparison is specified as a decibel level relative to "full scale" (or dBFS), so I was transforming each WAV sample into its dBFS level to compare the threshold to. As mentioned above, this transformation is somewhat expensive (especially when done for so many samples), so I ended up refactoring the code to transform the threshold into a sample to compare against the WAV samples directly -- thus there's only one transform as opposed to millions. This, coupled with refactoring the segmentation code to be more efficient, really helped speed up the processing time (and removed one use of the decibel data, as touched on above).
In the future I would like to parallelize the reading, segmentation, and display generation processes to speed up the initial load time even further. I also would like to try reading the file from the disk on-demand during playback, instead of keeping the whole thing in RAM all the time, to minimize memory usage (though at the expense of disk access -- I'll have to test it out and see how well it works).
The Presenter layer takes care of the UI logic, using a state-machine to run the UI FRP code. This application was the first time I'd really used FRP (besides some basic stream-processing in OOP languages) so there were a few iterations of refinement and optimization that this area underwent. There was also a weird memory leak in the state-machine that I will briefly discuss too.
The initial implementation of the FRP logic used the
proc arrow syntax for expressing some of the more complicated FRP logic. Unfortunately, from what I've read, this desugars into less-than-optimal code. A bit of testing with the arrow-qq utility in a branch confirmed this; the slowness could also be felt while using the application on the older systems on which I tested it.
As part of the optimization of this code I sought to move away from the
proc synatx, and to do this I would have to refactor the code using it. After some analysis, I saw that the code using it was mostly tracking the same things, e.g. the mouse's (global) position, just in different components.
Since so much of the code needed the same information, I figured I'd track it all in one place up front and feed that context into the FRP network along with the stream of user input events from the View.
All of these changes resulted in a palpable speed up of the UI logic.
This program has had some memory leaks that I've been trying to nail down for a while. While doing profiling, I noticed that having heap-profiling enabled seemed to make the leaks disappear. I figured that this was due to the leaks being thunks that were being evaluated during heap profiling where normally they wouldn't be.
After some investigation -- and I forget what line of thought led me to this course of action -- I ended up adding some type-signatures to a few of the top-level functions in the Presenter state-machine, and some of the leaks went away.
I wish I could explain this, but I can't.
Out of all the code in this codebase, the View is probably the area that's had the most optimizations applied to it, so we'll touch upon a variety of topics in this section. Profiling has shown that rendering the waveform to the screen is especially intensive, so most of the focus will be there.
Back when the application was immature I sought to isolate the code touching the different external libraries I'm using from the rest of my codebase by establishing interface boundaries. As part of this, I made low-level primitives for use in my code which corresponded to those offered by the SDL2 lib, to isolate the use of that libarary's types to the View. What is now the Presenter was at that point in charge of "rendering" a waveform into these low-level primitives, which were then transformed into the SDL2 equivalents in the View and written to the screen by the library.
This was obviously less than optimal: there was essentially double the memory allocation going on than was necessary because of the use of the intermediate types. After refactoring, a high-level description of the wavefrom is now passed from the Presenter to the View instead of the low-level primitives, and the "rendering" logic, now in the View, directly generates the SDL2 types to write to the screen, eliminating the use of intermediates.
When displaying the waveform, different colors are used for the gaps, clip backgrounds, clip wave content, and cursors, and these can change depending on how the user is interacting with them. The different colors that can be applied to these components are defined in a YAML configuration file, which is read into a nested data-structure when the program loads.
Since the structure is nested, I thought lenses might be a nice way to retrieve the colors from it while writing the components to the display. Unfortunately, the lens machinery appears to be too slow for such a performance-critical context, so I had to change my approach. Now after the nested structure is read from the file it gets mapped to a flat structure containing many, many fields. Retrieval of the colors from these fields is much faster, which resulted in speed up of the rendering code.
As a bonus to this, I can now support multiple versions of the color configuration file, since the nested structures will be mapped to the same flat structure, in case I ever want to change the format but maintain backwards compatibility (which I think I will).
While looking at the profiling output I saw that a few of the calls to the SDL2 library during waveform rendering were taking a surprising proportion of the time. These calls were being run in some loops that iterate over quite a lot of data, so this seemed worth optimizing.
The SDL2 functions have the constraint that the monad they're called in must implement
MonadIO, and some of the code around them in the profiling output made me think that it was the lifing machinery to get to the IO monad that was slowing things down. So, I lifted the enclosing loops with
liftIO to get all the loop code running in the base monad, and that sped things up.
Most of the rest of the optimizations done to the visuals code were just common tricks, e.g. moving the creation of constants out of the loops they're used in to the code before the loop (which is sometimes tough to see when the constant is part of a function that's being run in a loop) and using better algorithms or data-structures.
Selecting the right algorithm/data-structure for rendering the waveform played a large part in optimizing the display code, but as stated above that'll be covered in a future article.