SYSTEM_ONLINE :: PORTFOLIO_v2.0.0

Swarnavo
Mukherjee

CUDARustLLM InferenceGPU HypervisorRing-0 Kernel

VIEW_PROJECTS CONNECT GITHUB

SCROLL

// SECTION_01

PROFILE.

IDENTITY_CONFIRMED

Swarnavo Mukherjee

Kolkata, India · techmonk000

SYS_AUTH · v2.0

● ACCESS_GRANTED

HANDLE

techmonk000

CLEARANCE

RING-0

ARCH

x86_64

PRIMARY

Rust / CUDA

THREAT_SCAN

MALWARECLEAN

INTRUSIONBLOCKED

ANOMALY0 FOUND

KERNEL_STATUS

NEBIONRUNNING

CYBER-OSARMED

CUDAsm_89

UID · 0x4E4542494F4E
SESSION · ENCRYPTED

swarnavo@nebion:~

PERFORMANCE_METRICS

False Positive Reduction (Cyber-OS)

0.0×

Concurrent Users (Nebion H100)

Report Time Reduced (Alliance)

Lines of no-std kernel code

BIO

I build software that operates at the intersection of hardware and intelligence. My work spans hand-written CUDA kernels for flash attention, a multi-tenant GPU hypervisor that breaks the HBM concurrent-user ceiling, and a ring-0 OS kernel with in-kernel malware classification.

With my previous experience of building core systems for my company, I write software that is scalable and sits at the hardware layer — schedulers, memory managers, interrupt handlers, and tensor-core GEMM kernels.

RustC++/CUDAPythonTypeScriptPTX inline ASMTokio asyncNext.jsDocker

GITHUB LINKEDIN

SCROLL

// SECTION_02

PROJECTS.

PROJECT_01

Nebion

GPU Hypervisor Engine

GITHUB

RustCUDAC++PTX

Multi-tenant LLM inference server. Runs Qwen-70B on a single H100.

›Hand-written CUDA kernels: GEMM, flash attention, paged KV decode
›OpenAI-compatible HTTP surface via Axum + Tokio
›4-tier adaptive quantization (FP16/FP8 E4M3/INT8/INT4) with attention-mass EMA solver
›3-tier KV cache fabric: HBM → Pinned Host RAM → NVMe
›SLA-aware admission with mid-flight P0–P3 preemption
›2-3× more concurrent users on Qwen-7B / H100

3×

Concurrent users

70B

Params on H100

FP8

Min quantization

PROJECT_02

Cyber-OS

AI-Kernel Security Subsystem

GITHUB

RustWHPXx86_64 ASMC++

Ring-0 OS kernel with in-kernel ML malware classification.

›Rust WHPX VMM + ~5K-line no-std x86_64 kernel
›Own GDT/IDT/TSS/4-level paging with NX, preemptive scheduler
›SYSCALL/SYSRET — full ring-0 stack, sub-µs interrupt handling at 100 Hz
›In-kernel 9-stump gradient-boosted malware classifier (sub-µs inference)
›MalConv2 CNN + LightGBM-ABI with Authenticode + reputation filtering
›Cut false positives 96%: 1,198 → 50 on a 22,000-file scan

96%

FP reduction

22K

Files scanned

<1µs

Interrupt latency

SCROLL

// SECTION_03

SKILLS.

// gemm_fp16.cu — scroll to explore the kernel

gemm_fp16.cunebion / crates / nebion-cuda

1	#include <cuda_runtime.h>
2	#include <cuda_fp16.h>
3	#include <cublas_v2.h>
4	#include <cstddef>
5
6	namespace nebion {
7
8	constexpr int BM = 64;
9	constexpr int BN = 64;
10	constexpr int BK = 16;
11	constexpr int TM = 4;
12	constexpr int TN = 4;
13
14	static_assert(BM % TM == 0, "BM must be divisible by TM");
15	static_assert(BN % TN == 0, "BN must be divisible by TN");
16	static_assert((BM / TM) * (BN / TN) == 256, "block must have 256 threads");
17
18	__global__ void gemm_f16_kernel(
19	const int M, const int N, const int K,
20	const float alpha,
21	const __half* __restrict__ A,
22	const __half* __restrict__ B,
23	const float beta,
24	__half* __restrict__ C)
25	{
26	const int block_row = blockIdx.y * BM;
27	const int block_col = blockIdx.x * BN;
28
29	const int tx = threadIdx.x;
30	const int ty = threadIdx.y;
31
32	const int row = block_row + ty * TM;
33	const int col = block_col + tx * TN;
34
35	float c_reg[TM][TN] = {{0.0f}};
36
37	__shared__ __half A_tile[BM][BK];
38	__shared__ __half B_tile[BK][BN];
39
40	const int tid = ty * (BN / TN) + tx;
41
42	for (int k0 = 0; k0 < K; k0 += BK) {
43	#pragma unroll
44	for (int i = 0; i < 4; i++) {
45	const int linear = tid * 4 + i;
46	const int tile_r = linear / BK;
47	const int tile_c = linear % BK;
48	const int g_r = block_row + tile_r;
49	const int g_c = k0 + tile_c;
50	A_tile[tile_r][tile_c] =
51	(g_r < M && g_c < K) ? A[g_r * K + g_c] : __float2half(0.0f);
52	}
53
54	#pragma unroll
55	for (int i = 0; i < 4; i++) {
56	const int linear = tid * 4 + i;
57	const int tile_r = linear / BN;
58	const int tile_c = linear % BN;
59	const int g_r = k0 + tile_r;
60	const int g_c = block_col + tile_c;
61	B_tile[tile_r][tile_c] =
62	(g_r < K && g_c < N) ? B[g_r * N + g_c] : __float2half(0.0f);
63	}
64
65	__syncthreads();
66
67	#pragma unroll
68	for (int kk = 0; kk < BK; kk++) {
69	__half a_col[TM];
70	__half b_row[TN];
71
72	#pragma unroll
73	for (int i = 0; i < TM; i++) {
74	a_col[i] = A_tile[ty * TM + i][kk];
75	}
76	#pragma unroll
77	for (int j = 0; j < TN; j++) {
78	b_row[j] = B_tile[kk][tx * TN + j];
79	}
80	#pragma unroll
81	for (int i = 0; i < TM; i++) {
82	const float a_f = __half2float(a_col[i]);
83	#pragma unroll
84	for (int j = 0; j < TN; j++) {
85	c_reg[i][j] += a_f * __half2float(b_row[j]);
86	}
87	}
88	}
89
90	__syncthreads();
91	}
92
93	#pragma unroll
94	for (int i = 0; i < TM; i++) {
95	const int g_r = row + i;
96	if (g_r >= M) continue;
97	#pragma unroll
98	for (int j = 0; j < TN; j++) {
99	const int g_c = col + j;
100	if (g_c >= N) continue;
101	const float prev = (beta == 0.0f) ? 0.0f : __half2float(C[g_r * N + g_c]) * beta;
102	const float out = alpha * c_reg[i][j] + prev;
103	C[g_r * N + g_c] = __float2half(out);
104	}
105	}
106	}
107
108	} // namespace nebion
109
110	extern "C" {
111
112	int nebion_gemm_f16(
113	int M, int N, int K,
114	float alpha,
115	const void* A,
116	const void* B,
117	float beta,
118	void* C)
119	{
120	if (M <= 0 \|\| N <= 0 \|\| K <= 0) return 0;
121	if (A == nullptr \|\| B == nullptr \|\| C == nullptr) {
122	return static_cast<int>(cudaErrorInvalidValue);
123	}
124	if (M % nebion::BM != 0 \|\| N % nebion::BN != 0) {
125	return static_cast<int>(cudaErrorInvalidValue);
126	}
127
128	dim3 block(nebion::BN / nebion::TN, nebion::BM / nebion::TM);
129	dim3 grid(N / nebion::BN, M / nebion::BM);
130
131	nebion::gemm_f16_kernel<<<grid, block>>>(
132	M, N, K, alpha,
133	static_cast<const __half*>(A),
134	static_cast<const __half*>(B),
135	beta,
136	static_cast<__half*>(C)
137	);
138
139	cudaError_t err = cudaGetLastError();
140	return static_cast<int>(err);
141	}
142
143	static cublasHandle_t g_cublas_handle = nullptr;
144
145	static cublasStatus_t ensure_handle()
146	{
147	if (g_cublas_handle == nullptr) {
148	return cublasCreate(&g_cublas_handle);
149	}
150	return CUBLAS_STATUS_SUCCESS;
151	}
152
153	static int run_cublas_hgemm(
154	int M, int N, int K,
155	float alpha_f,
156	const void* A,
157	const void* B,
158	float beta_f,
159	void* C,
160	cublasGemmAlgo_t algo)
161	{
162	if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1;
163
164	const float alpha = alpha_f;
165	const float beta = beta_f;
166
167	cublasStatus_t status = cublasGemmEx(
168	g_cublas_handle,
169	CUBLAS_OP_N, CUBLAS_OP_N,
170	N, M, K,
171	&alpha,
172	B, CUDA_R_16F, N,
173	A, CUDA_R_16F, K,
174	&beta,
175	C, CUDA_R_16F, N,
176	CUBLAS_COMPUTE_32F,
177	algo
178	);
179
180	if (status != CUBLAS_STATUS_SUCCESS) {
181	return static_cast<int>(status);
182	}
183
184	cudaError_t err = cudaGetLastError();
185	return static_cast<int>(err);
186	}
187
188	int nebion_cublas_hgemm_tc(
189	int M, int N, int K,
190	float alpha_f,
191	const void* A,
192	const void* B,
193	float beta_f,
194	void* C)
195	{
196	if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1;
197
198	cublasSetMathMode(g_cublas_handle, CUBLAS_DEFAULT_MATH);
199
200	const float alpha = alpha_f;
201	const float beta = beta_f;
202
203	cublasStatus_t status = cublasGemmEx(
204	g_cublas_handle,
205	CUBLAS_OP_N, CUBLAS_OP_N,
206	N, M, K,
207	&alpha,
208	B, CUDA_R_16F, N,
209	A, CUDA_R_16F, K,
210	&beta,
211	C, CUDA_R_16F, N,
212	CUBLAS_COMPUTE_32F_FAST_16F,
213	CUBLAS_GEMM_DEFAULT_TENSOR_OP
214	);
215	if (status != CUBLAS_STATUS_SUCCESS) return static_cast<int>(status);
216
217	cudaError_t err = cudaGetLastError();
218	return static_cast<int>(err);
219	}
220
221	int nebion_cublas_hgemm_cuda_cores(
222	int M, int N, int K,
223	float alpha_f,
224	const void* A,
225	const void* B,
226	float beta_f,
227	void* C)
228	{
229	if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1;
230
231	cublasSetMathMode(g_cublas_handle, CUBLAS_PEDANTIC_MATH);
232
233	const float alpha = alpha_f;
234	const float beta = beta_f;
235
236	cublasStatus_t status = cublasGemmEx(
237	g_cublas_handle,
238	CUBLAS_OP_N, CUBLAS_OP_N,
239	N, M, K,
240	&alpha,
241	B, CUDA_R_16F, N,
242	A, CUDA_R_16F, K,
243	&beta,
244	C, CUDA_R_16F, N,
245	CUBLAS_COMPUTE_32F,
246	CUBLAS_GEMM_DEFAULT
247	);
248
249	cublasSetMathMode(g_cublas_handle, CUBLAS_DEFAULT_MATH);
250
251	if (status != CUBLAS_STATUS_SUCCESS) return static_cast<int>(status);
252	cudaError_t err = cudaGetLastError();
253	return static_cast<int>(err);
254	}
255
256	} // extern "C"

CUDA C++·UTF-8● nebion-cuda

GPU / CUDA

CUDAPTX inline ASMWMMA Tensor-CoreFlash Attention KernelsPaged KV Decode

Systems

RustC++Tokio async runtimex86_64 Ring-0WHPX VMMno-std kernel

ML / Research

PyTorchsafe-tensorsKV CompressionQuantization (FP8/INT8/INT4)LightGBM

Backend

Axum + TowerNext.js App RouterNode.jsPostgresDockerClerkREST

SCROLL

// SECTION_04

EXP_LOG.

POSITION_02

Software Engineer

Alliance Australia Property Pvt. Ltd

May 2025 — April 2026 · Kolkata, India

COMPLETED

dmesg | grep -i "alliance"

[05/2025]INFOCAREER

Joined Alliance Australia Property Pvt. Ltd as Software Engineer

[05/2025]INITSYSTEM

Bootstrapped LLM pipeline for AI-based property valuation reports

[06/2025]PERFPIPELINE

Data ingestion + automation → 60% reduction in manual report time

[07/2025]KERNELSEC_SUBSYS

Architected handwritten x86_64 kernel — detected 4+ critical vulnerabilities

[08/2025]PERFSEC_SUBSYS

System security hardened by 40% via kernel-level threat isolation

[09/2025]REVENUEMETRICS

Software generated 100 leads — company hit $10K MRR milestone

[04/2026]EXITCAREER

Role completed — full-cycle: infra engineer → kernel author → revenue driver

EDUCATION

B.Tech Computer Science

University of Engineering & Management

Kolkata, India

KEY_ACHIEVEMENTS

MRR Generated$10K

Leads via Software100+

Time Reduction60%

Security Improvement40%

LANGUAGES

Rust95%

C++/CUDA90%

Python85%

TypeScript80%

Java65%

SCROLL

// SECTION_05

CONNECT.

Building something at the intersection of hardware and intelligence?
Let's talk.

swarnavo@nebion:~ — contact.sh

$whoami

Swarnavo Mukherjee // GPU Systems · LLM Inference · CUDA

$cat location.txt

Kolkata, India · Open to remote worldwide

$cat availability.txt

Open to: Full-time · Contract · Research Collabs

$echo $EMAIL

swarnavomukherjee03@gmail.com

EMAILswarnavomukherjee03@gmail.com GITHUBtechmonk000 LINKEDINswarnavo-mukherjee

SWARNAVO.SYS v2.0.0

SwarnavoMukherjee

PROFILE.

PROJECTS.

Nebion

Cyber-OS

SKILLS.

EXP_LOG.

Software Engineer

CONNECT.

Swarnavo
Mukherjee