Merge pull request #4 from ljwoods2/streaming

Streaming
MDAnalysis · Nov 7, 2024 · ffdade9 · ffdade9
2 parents db87686 + 0ccc8b3
commit ffdade9
Show file tree

Hide file tree

Showing 3 changed files with 319 additions and 10 deletions.
diff --git a/_posts/2024-08-01-gsoc2024-final-sm.md b/_posts/2024-08-01-gsoc2024-final-sm.md
@@ -0,0 +1,28 @@
+---
+layout: post
+title: Google Summer of Code (GSoC) 2024 Wrap-Up: Small Projects
+---
+<p>
+<img
+src="https://developers.google.com/open-source/gsoc/images/gsoc2016-sun-373x373.png"
+title="Google Summer of Code" alt="Google Summer of Code"
+style="float: right; height: 8em; " />
+</p>
+
+We would like to highlight the work of two of our outstanding [Google Summer of Code](https://summerofcode.withgoogle.com/) (GSoC) contributors, [Lawson Woods](https://summerofcode.withgoogle.com/programs/2024/projects/BYYAE9MR) (@ljwoods2) and [Valerij Talagayev](https://summerofcode.withgoogle.com/programs/2024/projects/sfy3kuqc) (@talagayev), who successfully completed their [GSoC projects](https://summerofcode.withgoogle.com/programs/2024/organizations/mdanalysis) with MDAnalysis.
+
+As part of [GSoC's 20th anniversary celebration](https://opensource.googleblog.com/2023/11/google-summer-of-code-2024-celebrating-20th-year.html), Google introduced a new "small" project scope (~90 hour commitment) to its repertoire during the 2024 GSoC program to reduce time barriers to participation. Lawson and Valerij have both proven that though their projects may have been small in scope, their contributions to MDAnalysis during the GSoC period were invaluable.
+
+You can read more about Lawson and Valerij's projects in their blog posts:
+
+* Lawson: [GSoC 2024 - Enhancing the Interoperability and Efficiency of the Zarrtraj MDAKit][blog-lawson]
+* Valerij: [GSoC 2024 - 2D Visualization for Small Molecules][blog-valerij]
+
+We have tremendously enjoyed working with Lawson and Valerij and wish them the best in their future contributions to the MDAnalysis community and beyond!
+
+We also thank [Google](https://opensource.google/) for supporting the students and continuing to support MDAnalysis as a GSoC organization since 2016.
+
+— @cbouy @hmacdope @IAlibay @jennaswa @richardgowers @xhgchen @yuxuanzhuang (mentors and org admins)
+
+[blog-lawson]: {{ site.baseurl }}{% post_url .MD-FILE-NAME-HERE %}
+[blog-valerij]: {{ site.baseurl }}{% post_url 2024-08-02-gsoc2024_talagayev %}
diff --git a/_posts/2024-08-03-gsoc2024_woods.markdown b/_posts/2024-08-03-gsoc2024_woods.markdown
@@ -0,0 +1,263 @@
+---
+layout: single
+title:  "GSoC 2024 - Streaming H5MD trajectories from the cloud"
+categories: zarrtraj
+---
+
+## What is Zarrtraj?
+
+[Zarrtraj](https://github.com/Becksteinlab/zarrtraj) is an [MDAKit](https://mdakits.mdanalysis.org/about.html) for storing and analyzing trajectories in [MDAnalysis](https://github.com/MDAnalysis/mdanalysis) from the cloud, representing a major milestone towards the [MDAnalysis 3.0 goal of cloud streaming](https://www.mdanalysis.org/2023/10/25/towards_3.0) as well as a proof-of-concept for a new paradigm in the field of molecular dynamics as a whole.
+It can stream [H5MD](https://www.nongnu.org/h5md/h5md.html) trajectories from cloud storage 
+providers including [AWS S3](https://aws.amazon.com/s3/), [Google Cloud](https://cloud.google.com/) buckets, and [Azure Blob](https://azure.microsoft.com/en-us/products/storage/blobs) storage and data lakes.
+Using Zarrtraj, anyone can reproduce your analyses or train their machine learning (ML) model on trajectory data
+without ever having to download massive files to their disk.
+
+This is possible thanks to [Zarr](https://github.com/zarr-developers/zarr-python), [fsspec](https://github.com/fsspec/filesystem_spec),
+and [kerchunk](https://github.com/fsspec/kerchunk) packages that have created a foundation for storing and interacting with
+large datasets in a uniform way across different storage backends. Interestingly, these projects were intially developed by geoscientists in the [Pangeo project](https://pangeo.io/) to make use of [cloud computing credits from an NSF partnership with cloud providers](https://medium.com/pangeo/pangeo-2-0-2bedf099582d), but have since undergone wider adoption by the broader Python community. This project also represents one of the first forays of the molecular dynamics field into the excellent Pangeo ecosystem of tools.
+
+Zarr is especially well-suited to the task of reading cloud-stored files due to its integration with
+[dask](https://docs.dask.org/en/stable/) for parallelized reading. This parallelization offsets
+the increased IO time in cloud-streaming, speeding up common analysis algorithms up to ~4x compared to sequential analysis.
+See the [zarrtraj benchmarks](https://zarrtraj.readthedocs.io/en/latest/benchmarks.html) for more.
+
+In this project, we also decided to experiment with storing trajectories directly in Zarr-backed files
+using the same specification that H5MD uses, so Zarrtraj can read both `.h5md` and H5MD-formatted `.zarrmd`
+files. See [this explanation of the modified format](https://zarrtraj.readthedocs.io/en/latest/zarrmd-file-spec/v0.2.0.html)
+to learn more.
+
+While this GSoC project started with the goal of building a new, Zarr-backed trajectory format, we pivoted
+to making the existing H5MD format streamable after getting feedback
+from the community that supporting widely adopted formats in the MD ecosystem makes code more sustainable and simplifies tool adoption.
+
+The next section is a walkthrough of Zarrtraj's features and usage (also available
+[here](https://zarrtraj.readthedocs.io/en/latest/walkthrough.html)).
+
+## How can I use it?
+
+Zarrtraj is currently available via PyPI and Conda Forge.
+
+Pip installation 
+```bash
+pip install zarrtraj
+```
+
+Conda installation
+```bash
+conda install -c conda-forge zarrtraj
+```
+
+For more information on installation, see [the installation guide](https://zarrtraj.readthedocs.io/en/latest/installation.html#installation)
+
+This walkthrough will guide you through the process of reading and writing H5MD-formatted trajectories from cloud storage using 
+AWS S3 as an example. To learn more about reading and writing trajectories from different cloud storage providers, 
+including Google Cloud and Azure, see the [API documentation](https://zarrtraj.readthedocs.io/en/latest/api.html).
+
+### Reading H5MD trajectories from cloud storage
+
+#### Uploading your H5MD file
+
+First, upload your H5MD trajectories to an AWS S3 bucket. This requires that an S3 Bucket is setup and configured for 
+write access using the credentials stored in "sample_profile". If you've never configured an S3 Bucket before, see
+[this guide](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). You can setup a profile to easily manage AWS
+credentials using [this VSCode extension](https://marketplace.visualstudio.com/items?itemName=AmazonWebServices.aws-toolkit-vscode).
+Here is a sample profile (stored in ~/.aws/credentials) where 
+[the key is an access key associated with a user that has read and write permissions for the bucket](https://stackoverflow.com/questions/50802319/create-a-single-iam-user-to-access-only-specific-s3-bucket).
+
+```
+[sample_profile]
+aws_access_key_id = <key>
+```
+MDAnalysis can write a trajectory from
+[any of its supported formats into H5MD](https://docs.mdanalysis.org/stable/documentation_pages/coordinates/H5MD.html). We 
+recommend using the ```chunks``` kwarg with the MDAnalysis H5MDWriter with a value that yields ~8-16MB chunks of data for best S3 performance.
+Once written locally, you can upload the trajectory to S3 programmatically:
+
+```python
+import os
+from botocore.exceptions import ClientError
+import boto3
+import logging
+
+os.environ["AWS_PROFILE"] = "sample_profile"
+# This is the AWS region where the bucket is located
+os.environ["AWS_REGION"] = "us-west-1"
+
+def upload_h5md_file(bucket_name, file_name):
+    s3_client = boto3.client("s3")
+    obj_name = os.path.basename(file_name)
+
+    response = s3_client.upload_file(
+        file_name, bucket_name, obj_name
+    )
+
+
+if __name__ == "__main__":
+    # Using test H5MD file from the Zarrtraj repo
+    upload_h5md_file("sample-bucket-name", "zarrtraj/data/COORDINATES_SYNTHETIC_H5MD.h5md")
+```
+
+You can also upload the H5MD file directly using the AWS web interface by navigating to S3, the bucket name, and pressing
+"upload".
+
+#### Reading your H5MD file
+
+After the file is uploaded, you can use the same credentials to stream the file into MDAnalysis:
+
+```python
+import zarrtraj
+import MDAnalysis as mda
+# This sample topology requires installing MDAnalysisTests
+from MDAnalysisTests.datafiles import COORDINATES_TOPOLOGY
+import os
+
+os.environ["AWS_PROFILE"] = "sample_profile"
+os.environ["AWS_REGION"] = "us-west-1"
+
+u = mda.Universe(COORDINATES_TOPOLOGY, "s3://sample-bucket-name/COORDINATES_SYNTHETIC_H5MD.h5md")
+for ts in u.trajectory:
+    pass
+```
+
+You can follow this same process for reading `.zarrmd` files with the added advantage
+that Zarrtarj can write `.zarrmd` files directly into an S3 bucket.
+
+### Writing trajectories from MDAnalysis into a zarrmd file in an S3 Bucket 
+
+Using the same credentials with read/write access, you can write a trajectory
+into your bucket.
+
+You can change the stored precision of floating point values in the file with the optional
+`precision` kwarg and pass in any `numcodecs.Codec` compressor with the optional
+`compressor` kwarg. See [numcodecs](https://numcodecs.readthedocs.io/en/stable/)
+for more on the available compressors.
+
+Chunking is automatically determined for all datasets to be optimized for
+cloud storage and is not configurable by the user. 
+Initial benchmarks show this chunking strategy is effective for disk storage as well.
+
+```python
+import zarrtraj
+import MDAnalysis as mda
+from MDAnalysisTests.datafiles import PSF, DCD
+import numcodecs
+import os
+
+os.environ["AWS_PROFILE"] = "sample_profile"
+os.environ["AWS_REGION"] = "us-west-1"
+
+u = mda.Universe(PSF, DCD)
+with mda.Writer("s3://sample-bucket-name/test.zarrmd", 
+                n_atoms=u.trajectory.n_atoms, 
+                precision=3,
+                compressor=numcodecs.Blosc(cname="zstd", clevel=9)) as W:
+                for ts in u.trajectory:
+                    W.write(u.atoms)
+
+```
+
+If you have additional questions, please don't hesitate to open a discussion on the [zarrtarj github](https://github.com/Becksteinlab/zarrtraj).
+The [MDAnalysis discord](https://discord.com/channels/807348386012987462/) is also a 
+great resource for asking questions and getting involved in MDAnalysis; instructions for [joining the MDAnalysis discord server](https://www.mdanalysis.org/#participating) can be found on the MDAnalysis website.
+
+## What's next for Zarrtraj?
+
+Zarrtraj is currently in a fully operational state and is ready for use!
+However, I'm excited about creating some new features in the future that will
+make Zarrtraj more flexible and faster.
+
+### Lazytimeseries
+
+In MDAnalysis, many trajectory readers expose a `timeseries` method for getting access to
+coordinate data for a subselection of atoms across a trajectory. This provides 
+a viable way to sidestep the `Timestep` (eagerly-loaded frame-based) paradigm that 
+MDAnalysis uses for handling trajectory data. Zarrtraj could implement a 
+"lazytimeseries" that returns a lazy dask array of a selection of atoms' positions
+across the trajectory. Early benchmarks show that analysis based on such a lazy array
+can outperform `Timestep`-based analysis.
+
+### Asynchronous Reading
+
+The performance impact of network IO could be reduced by creating a multi-threaded `ZARRH5MDReader` that isn't blocked
+by analysis code executing. The reader could eagerly load the cache with the next frames
+the analysis code will need to reduce the impact of network IO on exeution time.
+
+## Acknowledgements
+
+A big thanks to Google for supporting the [Google Summer of Code program](https://summerofcode.withgoogle.com/) and to the GSoC team for enabling my project.
+
+Thank you to Dr. Hugo MacDermott-Opeskin ([@hmacdope](https://github.com/hmacdope)) and Dr. Yuxuan Zhuang ([@yuxuanzhuang](https://github.com/yuxuanzhuang)) for their mentorship and feedback
+throughout this project and to Dr. Jenna Swarthout Goddard ([@jennaswa](https://github.com/jennaswa)) for supporting the GSoC program 
+at MDAnalysis](https://summerofcode.withgoogle.com/organizations/mdanalysis/programs/2024/projects).
+
+I also want to thank Dr. Oliver Beckstein ([@orbeckst](https://github.com/orbeckst)) and Edis Jakupovic ([@edisj](https://github.com/edisj) for lending their expertise
+in H5MD and all things MDAnalysis.
+
+Finally, another thanks to Martin Durant ([@martindurant](https://github.com/martindurant)), author of Kerchunk, who was incredibly helpful in refining and merging
+a new feature in his codebase necessary for this project to work.
+
+### Citations ###
+
+Alistair Miles, jakirkham, M Bussonnier, Josh Moore, Dimitri Papadopoulos Orfanos, Davis Bennett, David Stansby, Joe Hamman, James Bourbeau, Andrew Fulton, Gregory Lee, Ryan Abernathey, Norman Rzepka, Zain Patel, Mads R. B. Kristensen, Sanket Verma, Saransh Chopra, Matthew Rocklin, AWA BRANDON AWA, … shikharsg. (2024). zarr-developers/zarr-python: v3.0.0-alpha (v3.0.0-alpha). Zenodo. https://doi.org/10.5281/zenodo.11592827
+
+de Buyl, P., Colberg, P. H., & Höfling, F. (2014). H5MD: A structured, efficient, and portable file format for molecular data. In Computer Physics Communications (Vol. 185, Issue 6, pp. 1546–1553). Elsevier BV. https://doi.org/10.1016/j.cpc.2014.01.018
+
+Gowers, R., Linke, M., Barnoud, J., Reddy, T., Melo, M., Seyler, S., Domański, J., Dotson, D., Buchoux, S., Kenney, I., & Beckstein, O. (2016). MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. In Proceedings of the Python in Science Conference. Python in Science Conference. SciPy. https://doi.org/10.25080/majora-629e541a-00e
+
+Jakupovic, E., & Beckstein, O. (2021). MPI-parallel Molecular Dynamics Trajectory Analysis with the H5MD Format in the MDAnalysis Python Package. In Proceedings of the Python in Science Conference. Python in Science Conference. SciPy. https://doi.org/10.25080/majora-1b6fd038-005
+
+Michaud‐Agrawal, N., Denning, E. J., Woolf, T. B., & Beckstein, O. (2011). MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. In Journal of Computational Chemistry (Vol. 32, Issue 10, pp. 2319–2327). Wiley. https://doi.org/10.1002/jcc.21787
+
+## Extras: Upstream code merged during this GSoC project
+
+### MDAnalysis
+
+One feature I needed for testing Zarrtraj was full use of writer kwargs
+when aligning a trajectory and writing it immediately rather than storing it in memory. 
+However, this feature hadn't yet been implemented, and luckily, it was a small change,
+so I worked with the core developers to merge this PR with the new feature and tests:
+- <https://github.com/MDAnalysis/mdanalysis/pull/4565>
+
+### MDAKits
+
+Since Zarrtraj is based on the [MDAnalysis MDAKit Cookiecutter](https://github.com/MDAnalysis/cookiecutter-mdakit)
+(which is a fantastic tool for getting started in making and distributing Python packages), I was able
+to find and fix a few small bugs along the way in my GSoC journey including:
+- <https://github.com/MDAnalysis/cookiecutter-mdakit/pull/115>
+- <https://github.com/MDAnalysis/cookiecutter-mdakit/pull/133>
+- <https://github.com/MDAnalysis/MDAKits/pull/140>
+
+### Kerchunk
+
+[Kerchunk](https://github.com/fsspec/kerchunk) is central to Zarrtraj's ability
+to read hdf5 files using Zarr. However, H5MD files (stored in hdf5) have linked 
+datasets as per the H5MD standard, but Kerchunk did not translate these from
+hdf5 to Zarr previously. I was able to work alongside a Kerchunk core developer
+to add the ability to translate hdf5 datasets into Zarr along with comprehensive
+tests of this new feature.
+- <https://github.com/fsspec/kerchunk/pull/463>
+
+## Extras: Lessons Learned
+
+Here are a bunch of random things I learned while doing this project!
+
+- Any time you need to run code that will take several hours to execute
+  due to file size or some other factor, create a minimal, quickly-executing 
+  example of the code to work out bugs before running the full thing. You will
+  save yourself so, so much frustration.
+- It is worth investing time into getting a debugging environment properly configured.
+  If you're hunting down a specific bug, it is worth the 20 minutes it will take to create
+  a barebones example of the bug instead of trying to hunt it down "in-situ". A lot of the time,
+  just creating the example will make you realize what was wrong. GH Actions runners sometimes
+  behave differently than your development machine. [This action](https://github.com/namespacelabs/breakpoint-action) 
+  for SSHing into a runner is FANTASTIC!
+- Maintain a "tmp" directory in your locally cloned repos, gitignore it, and use it for testing random ideas
+  you have or working through bugs. Take the time to give each file in it a descriptive name!
+  Having these random scripts and ideas all in one place will pay off massively later on.
+- Take risks with ideas if you suspect they might result in cleaner and faster code, 
+  even if you're not 100% sure! Experimenting is worth it.
+- Don't be afraid to read source code! Sometimes the fastest way to solving a problem
+  is seeing how someone else solved it, and sometimes the fastest way to learning
+  why someone else's code isn't doing what you expected is to read the code rather than
+  the docs.
+