Blogs
Building a High-Performance Synthetic Image Generation Pipeline: A Deep Dive
2026-02-12
How we built a scalable image generation system that creates millions of aesthetic quote images for training vision-language models
Read more →NucleusAI Migrated 1.5B Objects from S3 to GCS in under 96 hours
2026-02-12
Most migration writeups optimize for copy throughput. Ours optimized for dataset correctness under transformation. While moving the bytes, we rewrote the dataset’s metadata contract, validated payloads, and emitted replayable failure ledgers without turning the operation into weeks of manual tail-chasing.
Read more →Scalable Web Scraping at Scale: A Serverless Lambda Architecture
2026-02-11
In the age of big data, scraping millions of URLs efficiently while avoiding rate limits and detection remains a significant engineering challenge. This article details our production-grade serverless web scraping system that leverages AWS Lambda to process thousands of URLs concurrently while maintaining reliability and stealth.
Read more →How NucleusAI Curated a 1B Image Dataset for Generative Vision Models
2026-02-03
Building an image model is only partly a modeling problem. The other part is a data engineering problem disguised as a plumbing problem. This article covers how we curated a ~1B image dataset.
Read more →mHC-Triton: Building a 6× Faster Kernel for DeepSeek's Hyper-Connections
2026-01-28
A deep dive into implementing Manifold-Constrained Hyper-Connections with fused Triton kernels—achieving 6.2× faster training and 1.3× memory savings.
Read more →