Infrastructure

The infrastructure is composed out of the following main components:

As the resource need for this project is quite variable, running on Kubernetes allows the efficient use of (shared) resources.

k3s

Efficiently building on different architectures requires access to native servers for those architectures. While most cloud providers make this possible, Kubernetes simplifies the process by enabling orchestration of specific builds to designated nodes. This flexibility ensures that builds are executed on the appropriate hardware, optimizing performance and compatibility across diverse environments.

Memory Requirements

Because the memory requirements for each package are vastly different and range between a few hundred MBs to a ~ 12 GB, these must also be reflected in the resource requirements of each pod. Using requests.memory of 5Gi and limits.memory of 14Gi has been proven to work reliable for scheduling and individual resource needs.

Matrix builds

While daily package updates can be handled with a single process per OS/version, building binaries for all CRAN packages requires a different orchestration strategy. On average, each CRAN package has six versions (calculated by dividing the total number of binaries by the OS/versions built and the number of unique packages), making some level of parallelization necessary.

Initially, parallelizing at the level of package versions was implemented. However, this approach led to occasional conflicts when dependencies were installed into a shared package cache. It also introduced unpredictable memory requirements within workflows: some packages caused memory usage to spike beyond 30 GB, depending on the number of parallel workers. These spikes not only caused individual processes to crash but also demanded significantly higher overall resource limits.

A more robust solution was found by processing individual packages sequentially within each matrix job.

To build all versions of all packages, CRAN packages are divided into subsets, each comprising 1/10 or fewer of the total packages, and these subsets are processed in parallel. The total time required depends on factors such as the distribution (e.g., distributions with newer C compilers like Alpine tend to be faster) and the number of parallel workers. This approach typically takes anywhere from a few days to two weeks.

Storage: S3

Binaries need to be stored somewhere, and what better option than S3? S3 is significantly more cost-effective than traditional cloud disk storage and offers the added benefit of being accessible via a public API. Beyond AWS, the original provider and inventor of S3, there are numerous alternatives with better price-to-storage ratios and lower transfer costs.

The timing was perfect when Hetzner introduced their own S3-compatible object storage, coinciding with the start of this project’s build processing. This solution brings multiple advantages: lower overall storage costs, free internal traffic between Hetzner servers and their S3 storage, and the proximity of storage to the build servers, which minimizes upload latency.

Note

While uploading packages to S3 is not complicated, there hasn’t been any way to create the required PACKAGES index files for binaries stored there. This is why forked versions of {cranlike} and {desc} have been created.

Content Delivery Network (CDN)

Storing binaries in S3 works well for distribution, but it’s not inherently very fast. Adding a CDN in front of S3 enables caching and allows assets to be distributed via servers located in various regions worldwide. This significantly reduces download latency, making downloads feel much faster.

All packages are delivered through a CDN, which includes three dedicated static caches strategically placed in Germany, the USA, and Asia.

With a CDN in place, downloads are optimized to feel “fast” from virtually anywhere, with only minor variations depending on the user’s location.

The CDN determines when an asset is added to its permanent cache and how often it is revalidated against the S3 source. Since package binaries are one-time builds that typically remain unchanged unless a forced rebuild occurs, relying heavily on a permanent cache is highly efficient in this context.

Back to top