SYSTEM_ONLINE :: PORTFOLIO_v2.0.0
Swarnavo
Mukherjee
_
CUDARustLLM InferenceGPU HypervisorRing-0 Kernel
SCROLL
// SECTION_01
PROFILE.

IDENTITY_CONFIRMED
Swarnavo Mukherjee
Kolkata, India · techmonk000
SYS_AUTH · v2.0
● ACCESS_GRANTED
HANDLE
techmonk000
CLEARANCE
RING-0
ARCH
x86_64
PRIMARY
Rust / CUDA
THREAT_SCAN
MALWARECLEAN
INTRUSIONBLOCKED
ANOMALY0 FOUND
KERNEL_STATUS
NEBIONRUNNING
CYBER-OSARMED
CUDAsm_89
UID · 0x4E4542494F4E
SESSION · ENCRYPTED
SESSION · ENCRYPTED
swarnavo@nebion:~
$
$
$
$
$
$
$
$
$
$
$
PERFORMANCE_METRICS
0%
False Positive Reduction (Cyber-OS)
0.0×
Concurrent Users (Nebion H100)
0%
Report Time Reduced (Alliance)
0+
Lines of no-std kernel code
BIO
I build software that operates at the intersection of hardware and intelligence. My work spans hand-written CUDA kernels for flash attention, a multi-tenant GPU hypervisor that breaks the HBM concurrent-user ceiling, and a ring-0 OS kernel with in-kernel malware classification.
With my previous experience of building core systems for my company, I write software that is scalable and sits at the hardware layer — schedulers, memory managers, interrupt handlers, and tensor-core GEMM kernels.
RustC++/CUDAPythonTypeScriptPTX inline ASMTokio asyncNext.jsDocker
SCROLL
// SECTION_02
PROJECTS.
PROJECT_01
Nebion
GPU Hypervisor Engine
RustCUDAC++PTX
Multi-tenant LLM inference server. Runs Qwen-70B on a single H100.
- ›Hand-written CUDA kernels: GEMM, flash attention, paged KV decode
- ›OpenAI-compatible HTTP surface via Axum + Tokio
- ›4-tier adaptive quantization (FP16/FP8 E4M3/INT8/INT4) with attention-mass EMA solver
- ›3-tier KV cache fabric: HBM → Pinned Host RAM → NVMe
- ›SLA-aware admission with mid-flight P0–P3 preemption
- ›2-3× more concurrent users on Qwen-7B / H100
3×
Concurrent users
70B
Params on H100
FP8
Min quantization
PROJECT_02
Cyber-OS
AI-Kernel Security Subsystem
RustWHPXx86_64 ASMC++
Ring-0 OS kernel with in-kernel ML malware classification.
- ›Rust WHPX VMM + ~5K-line no-std x86_64 kernel
- ›Own GDT/IDT/TSS/4-level paging with NX, preemptive scheduler
- ›SYSCALL/SYSRET — full ring-0 stack, sub-µs interrupt handling at 100 Hz
- ›In-kernel 9-stump gradient-boosted malware classifier (sub-µs inference)
- ›MalConv2 CNN + LightGBM-ABI with Authenticode + reputation filtering
- ›Cut false positives 96%: 1,198 → 50 on a 22,000-file scan
96%
FP reduction
22K
Files scanned
<1µs
Interrupt latency
SCROLL
// SECTION_03
SKILLS.
// gemm_fp16.cu — scroll to explore the kernel
gemm_fp16.cunebion / crates / nebion-cuda
| 1 | #include <cuda_runtime.h> |
| 2 | #include <cuda_fp16.h> |
| 3 | #include <cublas_v2.h> |
| 4 | #include <cstddef> |
| 5 | |
| 6 | namespace nebion { |
| 7 | |
| 8 | constexpr int BM = 64; |
| 9 | constexpr int BN = 64; |
| 10 | constexpr int BK = 16; |
| 11 | constexpr int TM = 4; |
| 12 | constexpr int TN = 4; |
| 13 | |
| 14 | static_assert(BM % TM == 0, "BM must be divisible by TM"); |
| 15 | static_assert(BN % TN == 0, "BN must be divisible by TN"); |
| 16 | static_assert((BM / TM) * (BN / TN) == 256, "block must have 256 threads"); |
| 17 | |
| 18 | __global__ void gemm_f16_kernel( |
| 19 | const int M, const int N, const int K, |
| 20 | const float alpha, |
| 21 | const __half* __restrict__ A, |
| 22 | const __half* __restrict__ B, |
| 23 | const float beta, |
| 24 | __half* __restrict__ C) |
| 25 | { |
| 26 | const int block_row = blockIdx.y * BM; |
| 27 | const int block_col = blockIdx.x * BN; |
| 28 | |
| 29 | const int tx = threadIdx.x; |
| 30 | const int ty = threadIdx.y; |
| 31 | |
| 32 | const int row = block_row + ty * TM; |
| 33 | const int col = block_col + tx * TN; |
| 34 | |
| 35 | float c_reg[TM][TN] = {{0.0f}}; |
| 36 | |
| 37 | __shared__ __half A_tile[BM][BK]; |
| 38 | __shared__ __half B_tile[BK][BN]; |
| 39 | |
| 40 | const int tid = ty * (BN / TN) + tx; |
| 41 | |
| 42 | for (int k0 = 0; k0 < K; k0 += BK) { |
| 43 | #pragma unroll |
| 44 | for (int i = 0; i < 4; i++) { |
| 45 | const int linear = tid * 4 + i; |
| 46 | const int tile_r = linear / BK; |
| 47 | const int tile_c = linear % BK; |
| 48 | const int g_r = block_row + tile_r; |
| 49 | const int g_c = k0 + tile_c; |
| 50 | A_tile[tile_r][tile_c] = |
| 51 | (g_r < M && g_c < K) ? A[g_r * K + g_c] : __float2half(0.0f); |
| 52 | } |
| 53 | |
| 54 | #pragma unroll |
| 55 | for (int i = 0; i < 4; i++) { |
| 56 | const int linear = tid * 4 + i; |
| 57 | const int tile_r = linear / BN; |
| 58 | const int tile_c = linear % BN; |
| 59 | const int g_r = k0 + tile_r; |
| 60 | const int g_c = block_col + tile_c; |
| 61 | B_tile[tile_r][tile_c] = |
| 62 | (g_r < K && g_c < N) ? B[g_r * N + g_c] : __float2half(0.0f); |
| 63 | } |
| 64 | |
| 65 | __syncthreads(); |
| 66 | |
| 67 | #pragma unroll |
| 68 | for (int kk = 0; kk < BK; kk++) { |
| 69 | __half a_col[TM]; |
| 70 | __half b_row[TN]; |
| 71 | |
| 72 | #pragma unroll |
| 73 | for (int i = 0; i < TM; i++) { |
| 74 | a_col[i] = A_tile[ty * TM + i][kk]; |
| 75 | } |
| 76 | #pragma unroll |
| 77 | for (int j = 0; j < TN; j++) { |
| 78 | b_row[j] = B_tile[kk][tx * TN + j]; |
| 79 | } |
| 80 | #pragma unroll |
| 81 | for (int i = 0; i < TM; i++) { |
| 82 | const float a_f = __half2float(a_col[i]); |
| 83 | #pragma unroll |
| 84 | for (int j = 0; j < TN; j++) { |
| 85 | c_reg[i][j] += a_f * __half2float(b_row[j]); |
| 86 | } |
| 87 | } |
| 88 | } |
| 89 | |
| 90 | __syncthreads(); |
| 91 | } |
| 92 | |
| 93 | #pragma unroll |
| 94 | for (int i = 0; i < TM; i++) { |
| 95 | const int g_r = row + i; |
| 96 | if (g_r >= M) continue; |
| 97 | #pragma unroll |
| 98 | for (int j = 0; j < TN; j++) { |
| 99 | const int g_c = col + j; |
| 100 | if (g_c >= N) continue; |
| 101 | const float prev = (beta == 0.0f) ? 0.0f : __half2float(C[g_r * N + g_c]) * beta; |
| 102 | const float out = alpha * c_reg[i][j] + prev; |
| 103 | C[g_r * N + g_c] = __float2half(out); |
| 104 | } |
| 105 | } |
| 106 | } |
| 107 | |
| 108 | } // namespace nebion |
| 109 | |
| 110 | extern "C" { |
| 111 | |
| 112 | int nebion_gemm_f16( |
| 113 | int M, int N, int K, |
| 114 | float alpha, |
| 115 | const void* A, |
| 116 | const void* B, |
| 117 | float beta, |
| 118 | void* C) |
| 119 | { |
| 120 | if (M <= 0 || N <= 0 || K <= 0) return 0; |
| 121 | if (A == nullptr || B == nullptr || C == nullptr) { |
| 122 | return static_cast<int>(cudaErrorInvalidValue); |
| 123 | } |
| 124 | if (M % nebion::BM != 0 || N % nebion::BN != 0) { |
| 125 | return static_cast<int>(cudaErrorInvalidValue); |
| 126 | } |
| 127 | |
| 128 | dim3 block(nebion::BN / nebion::TN, nebion::BM / nebion::TM); |
| 129 | dim3 grid(N / nebion::BN, M / nebion::BM); |
| 130 | |
| 131 | nebion::gemm_f16_kernel<<<grid, block>>>( |
| 132 | M, N, K, alpha, |
| 133 | static_cast<const __half*>(A), |
| 134 | static_cast<const __half*>(B), |
| 135 | beta, |
| 136 | static_cast<__half*>(C) |
| 137 | ); |
| 138 | |
| 139 | cudaError_t err = cudaGetLastError(); |
| 140 | return static_cast<int>(err); |
| 141 | } |
| 142 | |
| 143 | static cublasHandle_t g_cublas_handle = nullptr; |
| 144 | |
| 145 | static cublasStatus_t ensure_handle() |
| 146 | { |
| 147 | if (g_cublas_handle == nullptr) { |
| 148 | return cublasCreate(&g_cublas_handle); |
| 149 | } |
| 150 | return CUBLAS_STATUS_SUCCESS; |
| 151 | } |
| 152 | |
| 153 | static int run_cublas_hgemm( |
| 154 | int M, int N, int K, |
| 155 | float alpha_f, |
| 156 | const void* A, |
| 157 | const void* B, |
| 158 | float beta_f, |
| 159 | void* C, |
| 160 | cublasGemmAlgo_t algo) |
| 161 | { |
| 162 | if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1; |
| 163 | |
| 164 | const float alpha = alpha_f; |
| 165 | const float beta = beta_f; |
| 166 | |
| 167 | cublasStatus_t status = cublasGemmEx( |
| 168 | g_cublas_handle, |
| 169 | CUBLAS_OP_N, CUBLAS_OP_N, |
| 170 | N, M, K, |
| 171 | &alpha, |
| 172 | B, CUDA_R_16F, N, |
| 173 | A, CUDA_R_16F, K, |
| 174 | &beta, |
| 175 | C, CUDA_R_16F, N, |
| 176 | CUBLAS_COMPUTE_32F, |
| 177 | algo |
| 178 | ); |
| 179 | |
| 180 | if (status != CUBLAS_STATUS_SUCCESS) { |
| 181 | return static_cast<int>(status); |
| 182 | } |
| 183 | |
| 184 | cudaError_t err = cudaGetLastError(); |
| 185 | return static_cast<int>(err); |
| 186 | } |
| 187 | |
| 188 | int nebion_cublas_hgemm_tc( |
| 189 | int M, int N, int K, |
| 190 | float alpha_f, |
| 191 | const void* A, |
| 192 | const void* B, |
| 193 | float beta_f, |
| 194 | void* C) |
| 195 | { |
| 196 | if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1; |
| 197 | |
| 198 | cublasSetMathMode(g_cublas_handle, CUBLAS_DEFAULT_MATH); |
| 199 | |
| 200 | const float alpha = alpha_f; |
| 201 | const float beta = beta_f; |
| 202 | |
| 203 | cublasStatus_t status = cublasGemmEx( |
| 204 | g_cublas_handle, |
| 205 | CUBLAS_OP_N, CUBLAS_OP_N, |
| 206 | N, M, K, |
| 207 | &alpha, |
| 208 | B, CUDA_R_16F, N, |
| 209 | A, CUDA_R_16F, K, |
| 210 | &beta, |
| 211 | C, CUDA_R_16F, N, |
| 212 | CUBLAS_COMPUTE_32F_FAST_16F, |
| 213 | CUBLAS_GEMM_DEFAULT_TENSOR_OP |
| 214 | ); |
| 215 | if (status != CUBLAS_STATUS_SUCCESS) return static_cast<int>(status); |
| 216 | |
| 217 | cudaError_t err = cudaGetLastError(); |
| 218 | return static_cast<int>(err); |
| 219 | } |
| 220 | |
| 221 | int nebion_cublas_hgemm_cuda_cores( |
| 222 | int M, int N, int K, |
| 223 | float alpha_f, |
| 224 | const void* A, |
| 225 | const void* B, |
| 226 | float beta_f, |
| 227 | void* C) |
| 228 | { |
| 229 | if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1; |
| 230 | |
| 231 | cublasSetMathMode(g_cublas_handle, CUBLAS_PEDANTIC_MATH); |
| 232 | |
| 233 | const float alpha = alpha_f; |
| 234 | const float beta = beta_f; |
| 235 | |
| 236 | cublasStatus_t status = cublasGemmEx( |
| 237 | g_cublas_handle, |
| 238 | CUBLAS_OP_N, CUBLAS_OP_N, |
| 239 | N, M, K, |
| 240 | &alpha, |
| 241 | B, CUDA_R_16F, N, |
| 242 | A, CUDA_R_16F, K, |
| 243 | &beta, |
| 244 | C, CUDA_R_16F, N, |
| 245 | CUBLAS_COMPUTE_32F, |
| 246 | CUBLAS_GEMM_DEFAULT |
| 247 | ); |
| 248 | |
| 249 | cublasSetMathMode(g_cublas_handle, CUBLAS_DEFAULT_MATH); |
| 250 | |
| 251 | if (status != CUBLAS_STATUS_SUCCESS) return static_cast<int>(status); |
| 252 | cudaError_t err = cudaGetLastError(); |
| 253 | return static_cast<int>(err); |
| 254 | } |
| 255 | |
| 256 | } // extern "C" |
CUDA C++·UTF-8● nebion-cuda
GPU / CUDA
CUDAPTX inline ASMWMMA Tensor-CoreFlash Attention KernelsPaged KV Decode
Systems
RustC++Tokio async runtimex86_64 Ring-0WHPX VMMno-std kernel
ML / Research
PyTorchsafe-tensorsKV CompressionQuantization (FP8/INT8/INT4)LightGBM
Backend
Axum + TowerNext.js App RouterNode.jsPostgresDockerClerkREST
SCROLL
// SECTION_04
EXP_LOG.
POSITION_01
Software Engineer
J.P. Morgan Chase & Co.
August 2026 · Upcoming
UPCOMING
POSITION_02
Software Engineer
Alliance Australia Property Pvt. Ltd
May 2025 — April 2026 · Kolkata, India
COMPLETED
dmesg | grep -i "alliance"
[05/2025]INFO
Joined Alliance Australia Property Pvt. Ltd as Software Engineer[05/2025]INIT
Bootstrapped LLM pipeline for AI-based property valuation reports[06/2025]PERF
Data ingestion + automation → 60% reduction in manual report time[07/2025]KERNEL
Architected handwritten x86_64 kernel — detected 4+ critical vulnerabilities[08/2025]PERF
System security hardened by 40% via kernel-level threat isolation[09/2025]REVENUE
Software generated 100 leads — company hit $10K MRR milestone[04/2026]EXIT
Role completed — full-cycle: infra engineer → kernel author → revenue driverEDUCATION
B.Tech Computer Science
University of Engineering & Management
Kolkata, India
KEY_ACHIEVEMENTS
MRR Generated$10K
Leads via Software100+
Time Reduction60%
Security Improvement40%
LANGUAGES
Rust95%
C++/CUDA90%
Python85%
TypeScript80%
Java65%
SCROLL
// SECTION_05
CONNECT.
Building something at the intersection of hardware and intelligence?
Let's talk.
swarnavo@nebion:~ — contact.sh
$whoami
Swarnavo Mukherjee // GPU Systems · LLM Inference · CUDA
$cat location.txt
Kolkata, India · Open to remote worldwide
$cat availability.txt
Open to: Full-time · Contract · Research Collabs
$echo $EMAIL
swarnavomukherjee03@gmail.com
$
SWARNAVO.SYS v2.0.0