Arunkumar Mathiyazhagan

Tuesday, May 19, 2026

Why Apache Arrow Eliminates Deserialization: The Power of Columnar Memory-Mapped Data

Every data engineer has felt the pain of deserialization — that invisible tax you pay before your code can do anything useful. But what if the data on disk was already in the format your program needs? That's the radical promise of Apache Arrow IPC.

The Traditional Way: Row-by-Row Deserialization

In row-oriented formats (JSON, CSV, Protobuf, Avro), data is stored like this:

Row 1: {name: "Alice", age: 30, score: 95.2}
Row 2: {name: "Bob",   age: 25, score: 88.7}
Row 3: {name: "Carol", age: 28, score: 91.0}

To use this data, your program must:

Parse each row — find field boundaries, decode types
Allocate a new object/struct per row — heap pressure grows linearly
Copy values into those objects — bytes move from kernel buffers into your application's memory

This is deserialization — converting bytes on disk or wire into usable in-memory structures. It's O(n) work proportional to the number of rows, and it happens before you can do anything useful with the data.

For a 100-million-row dataset, that's 100 million parse-allocate-copy cycles just to get started.

The Columnar Way: Arrow IPC

Arrow flips the model entirely. Instead of storing data row-by-row, it stores data column-by-column as flat, typed arrays in memory:

name_buffer:  ["Alice", "Bob", "Carol"]   ← contiguous bytes
age_buffer:   [30, 25, 28]               ← contiguous int32 array
score_buffer: [95.2, 88.7, 91.0]         ← contiguous float64 array

The critical insight: the on-disk/on-wire format IS the in-memory format. There's no transformation needed.

This isn't just a storage optimization — it's an architectural decision that eliminates an entire class of work.

What "Memory-Mapped" Really Means

When you memory-map an Arrow file, the difference is stark:

Traditional:  disk bytes → parse → allocate → copy → usable data
Arrow:        disk bytes → usable data (same thing)

The OS maps the file directly into your process's virtual address space. The age_buffer on disk is already a valid int32[] array — your code can index into it (age_buffer[2] → 28) with zero parsing. The CPU just does a pointer dereference.

No allocation. No copying. No parsing. The data is simply there.

Why This Matters at Scale

Zero-copy reads — No CPU cycles wasted on deserialization
Instant startup — A terabyte file is "loaded" in microseconds (the OS handles paging on demand)
Cache-friendly — Columnar layout means sequential memory access patterns that modern CPUs love
Cross-language — The same Arrow buffers work in Python, Rust, Java, C++, and Go without conversion

The Bottom Line

Traditional formats force you to pay a deserialization toll proportional to your data size. Arrow IPC eliminates that toll entirely by making the wire format and the compute format identical. When your data is already in the shape your CPU needs, the fastest deserialization is no deserialization at all.

Tuesday, May 12, 2026

Cactus to Clouds: My First Big Hike

I am not a hiker. Four weeks before May 2, 2026, I had never done a hike outside my hometown. My group trained me at Mission Peak and North Peak in the Bay Area — four weeks of weekend climbs to get my legs ready. Then we drove down to Palm Springs, rented an Airbnb, and on the night of May 1st I set an alarm for midnight.

Cactus to Clouds is rated one of the hardest day hikes in the world. 10,800 feet of climbing. 20 miles. Starting in the Sonoran Desert and ending on the summit of Mt. San Jacinto at 10,834 feet. I didn't fully understand what I had signed up for until I was already on the trail.

The Start: 12:30am, Palm Springs Art Museum

Eight of us left the Palm Springs Art Museum parking lot at 12:30am. It was dark and warm — desert warm, the kind that doesn't feel dangerous yet. I had my headlamp on, trekking poles in hand, and 6 liters of water: 2 in a hydration bladder on my back and 4 in bottles.

The Skyline Trail starts steep immediately. There are no official trail markers for the first several miles — just white painted dots on rocks that local hikers maintain. In the dark, you hike from dot to dot, scanning with your headlamp. Navigation on this section is genuinely hard. I stayed close to my group and trusted them completely. Without them I would have gone off trail within the first mile.

The first gut check is a set of picnic benches about a mile in. We stopped briefly, checked in with each other, and kept moving. The climbing doesn't relent.

Rescue 1 to Grubbs Notch: The Real Work

By the time we passed Rescue 1 at around mile 2.5, the sun was coming up. The desert opened up below us and the views started. But the trail kept going up — relentlessly, brutally up. Almost 1,000 feet of gain per mile on the Skyline.

Tombstone Rock. Flat Rock. The Traverse. Each landmark felt like a small victory and a reminder of how much was left. The Traverse section — steep, loose, shaded — was where I started to feel the weight of the day. My legs were working. My mind was working harder.

Then Grubbs Notch.

When we pulled ourselves over that notch, other hikers around us said the same thing: "The hard part is over. It's easy from here." My group echoed it. I believed it. I made a decision I would regret.

The Mistake: Long Valley Ranger Station

At the Long Valley Ranger Station — mile 10, elevation 8,600 feet — I dumped my four water bottles. All of them. I kept only the 1 liter remaining in my hydration bladder. My hiking partner did the same. We had just climbed 8,000 feet in 10 miles. We were tired. The bottles felt heavy. And everyone said it was easy from here.

What nobody told us — what I didn't understand from reading any guide — is that "easy" is relative. Easy compared to the Skyline. Not easy in any absolute sense. Long Valley to the summit is another 4.5 miles and 2,200 feet of climbing at altitude, through pine forest, past Round Valley, up to Wellman Divide, and then a rocky scramble to the top. For a first-time hiker at hour 10 of a 20-mile day, it is not easy.

It took me 8 hours from Long Valley to the summit and back to the tram. Eight hours on 1 liter of water.

Running Dry

Somewhere between Long Valley and the summit, my bladder ran empty. I squeezed the hose and got nothing. I kept hiking.

The headache came on gradually — a dull pressure behind my eyes that I tried to ignore. I had pain relievers in my pack and took them. I focused on what had gotten me through the Skyline: baby steps. One foot, then the other. Don't look at the distance remaining. Just move.

Nobody in my group had extra water to share. We were all managing our own reserves. I pushed through dry.

The Summit: 5pm

We reached the summit of Mt. San Jacinto at 5pm — nearly 17 hours after leaving the trailhead. Most hikers had already descended to catch the last tram. The summit was quiet. Just me and my hiking partner and a view that stretched across the entire Coachella Valley and beyond.

I don't have the right words for what I felt up there. Positivity is the closest I can get — a kind of amplified, clear-headed positivity that the altitude and the exhaustion and the views all combined to produce. The breeze was cold and clean. We took pictures.

I thought about my kids and my spouse. They had managed everything at home while I trained and traveled. They had cheered me on and asked for updates. Standing on that summit, I dedicated the hike to them. They made it possible.

The Descent and the Tram

Coming down from the summit to the tram station is 5.5 miles of trail you've never hiked before, on legs that have already done 14.5 miles and 10,800 feet of climbing. It is not a victory lap. It is work.

We reached Mountain Station at 8:30pm. The tram was still running. We bought drinks at the gift shop, sat down, and didn't say much for a while. The tram ride down felt surreal — watching the desert floor rise up to meet us, the same terrain we had climbed in the dark now lit by the last light of the day.

Total time: 20 hours. Start to finish.

What I Learned

Don't dump your water at Long Valley. The right approach — especially for someone like me on their first big hike — is 3 liters from the start to the Long Valley Ranger Station, then refill 3 liters there for the second half. That's it. I carried 6 liters to the ranger station and threw 4 of them away. My partner did the same. Don't do that. Other hikers telling you it's easy from Grubbs Notch are not lying — they're just measuring against the Skyline. Carry the water anyway.

"Easy from here" is not a water strategy. I ran out between Long Valley and the summit. I finished with a headache and no water. That is a serious situation on a remote alpine trail. It could have been much worse.

Baby steps are a real strategy. When the trail felt impossible — at Grubbs Notch, on the Long Valley plateau, on the final rocky scramble — I stopped thinking about the summit and focused on the next step. Just the next step. It works.

Your group is everything. I would not have navigated the Skyline in the dark alone. I would not have had the confidence to attempt this hike without four weeks of training with people who believed I could do it. If you're a first-time big hiker, do this with people who know what they're doing.

Train seriously. Mission Peak and North Peak gave me the leg strength to survive this hike. They did not fully prepare me for 20 miles and 10,800 feet. but gave my group the confidence that I could do C2C.

The Numbers


Date	May 2, 2026
Start time	12:30am
Summit time	5:00pm
Tram station	8:30pm
Total time	~20 hours
Distance	20 miles
Elevation gain	10,800 ft
Group size	8 hikers
Water carried	6L (dumped 4L at Long Valley — don't do this)

Would I Do It Again?

Yes. But I'd carry the water.

C2C is genuinely one of the hardest things I've done. It is also one of the most rewarding. The Skyline at dawn, the silence on the summit, the tram ride down — those moments don't fade. If you're considering it, train hard, respect the water, and go with people you trust.

Route reference: hikingguy.com/hikes/cactus-to-clouds — the most thorough C2C guide I found, and the one I used to plan this hike.

Thursday, April 30, 2026

Amazon Quick Desktop: A Practical Guide to Getting Started

The views expressed in this post are my own and do not represent the opinions, positions, or endorsements of my current or any former employer.

Amazon Quick Desktop is a native AI assistant for macOS and Windows, launched in April 2026. It connects to your work tools — Google Workspace, Microsoft 365, Slack, Salesforce, Zoom, Jira, and more — and runs persistently in the background, learning the context of your work over time. This guide covers what it does, how to set it up, and how different teams are using it.

What Is Amazon Quick?

Amazon Quick is an AI assistant designed to work across all the tools you already use. Rather than operating within a single app, it connects your local files, calendar, email, and cloud applications into one place and builds a personal knowledge graph — a living model of your role, priorities, relationships, and projects that gets more useful the longer you use it.

The core idea is to reduce the time spent hunting for information across disconnected systems and replace it with a single interface that can answer questions, take actions, and automate workflows on your behalf.

Getting Started

No AWS account or credit card is required to try it.

Sign up at aws.amazon.com/quick using your email, Google, Apple, Amazon, or GitHub account
Download the desktop app for macOS or Windows
Connect your data sources through the onboarding wizard — Google Workspace, Microsoft 365, Slack, Salesforce, Zoom, and others
Quick begins indexing your files and emails in the background and starts building context

Every new signup includes a free 30-day trial of the Plus plan, which includes full desktop access and expanded agent hours. A free tier is available after the trial.

Core Capabilities

Personal Knowledge Graph

Quick builds a private model of your work — the people you collaborate with, the projects you're involved in, the documents and data you use regularly. This context shapes every answer and action. It's stored locally on your device and is not used to train Amazon's models.

Proactive Intelligence

Rather than waiting for you to ask a question, Quick monitors your connected apps in the background and surfaces what needs attention — an unanswered priority email, a Salesforce deal that hasn't been updated, a document waiting for your feedback. You can ask "What am I missing today?" or "What should I prioritize?" and get answers grounded in your actual work context.

Deep Research

Quick Research is a built-in agent that investigates questions by pulling from your internal documents, the public internet, and third-party datasets simultaneously. It creates a research plan, gathers evidence, and produces a fully cited, exportable report. Reports include clickable citations, version history, and export options for PDF, Word, and custom summary formats.

Document and Dashboard Creation

You can generate deliverables directly within the chat — presentations, spreadsheets, Word documents, PDFs, images, and live dashboards. Dashboards connect to your data sources and update automatically. No need to switch to a separate tool.

Workflow Automation

There are two levels of automation:

Quick Flows — natural language-based workflows for repeatable tasks like generating weekly reports, routing approvals, or sending automated briefings. No coding required.
Quick Automate (Professional/Enterprise) — complex, multi-step automations across systems, such as syncing data between Salesforce and a data warehouse or orchestrating cross-team processes.

Team Spaces and Custom Agents

Spaces are shared knowledge environments where teams pool documents, data, and AI agents around a project. You can also build custom chat agents configured with specific knowledge sources, personas, and guardrails — for example, an HR policy assistant, a sales pipeline tracker, or a project health monitor — and share them across your team.

Extensions

Quick extends beyond the desktop app into browsers (Chrome, Edge, Firefox) and directly into Microsoft Office apps (Word, Excel, PowerPoint, Teams, Outlook) and Slack, so you can access its capabilities without switching windows.

How Teams Are Using It

HR — Onboarding Automation

At the Austin Amazon Quick User Group meetup in January 2026, attendees built a working HR onboarding workflow in under an hour. The setup: upload an employee handbook, leave policy, performance review guidelines, and onboarding checklist into a Space, then create a Quick Flow that accepts employee questions as input, searches the HR documentation, and returns sourced answers automatically. The flow can be shared with the whole HR team.

Operations — Mystery Shopping Review (Ironside Group / HS Brands)

Ironside Group built a solution combining Amazon Quick and Amazon Bedrock to automate survey review for HS Brands Global, a mystery shop provider. The automated system detected inconsistencies, errors, and potential fraud across large volumes of unstructured survey data. Results: review time per batch dropped from days to seconds, annual review capacity scaled from roughly 50,000 shops to millions, and costs were reduced by approximately 85%.

Insurance — Nightly Reconciliation and Compliance (New York Life)

New York Life's Institutional Life division used Quick to replace a manual reporting process that required pulling multiple reports and waiting on analysts. A single conversational agent now handles structured operational data and unstructured documentation together. Their compliance dashboards moved from static reporting to live, self-service analytics, and nightly reconciliation workflows that previously required manual intervention are now automated with Quick Flows.

Manufacturing / Sales — Pipeline Insights (3M)

3M's sales teams used Quick Flows to automate administrative tasks like generating report summaries and updating records, and Quick's agentic capabilities to synthesize information across sales effectiveness, risks, and pricing from multiple platforms.

Pharma Research — Clinical Trial Site Database (Kitsa)

Kitsa built KScout on Amazon Quick — a database of over 300,000 clinical trial sites across 160+ countries — with a team of fewer than five people. Quick powers autonomous site research, medical literature review, and intelligence report generation.

Privacy and Security

Data stays on your device — conversation history, memory, knowledge graph, and file indexes are stored locally and not uploaded to the cloud
No model training on your data — AWS does not use your data to train models on any plan, free or paid
Write operations require approval — Quick will not send an email, update a record, or take an action without your explicit confirmation
Certifications: HIPAA eligible, FedRAMP authorized, SOC 2 audited
Open standards: supports Model Context Protocol (MCP), allowing integration with third-party agents and tools

Plans

Plan	Includes	Desktop app
Free	Chat, Spaces, custom agents, knowledge bases, extensions	No
Plus	Full desktop app, expanded agent hours	Yes
Professional	Quick Sight (BI), Quick Automate, AWS data connectivity	Yes
Enterprise	SSO, advanced governance, region selection, full AWS infrastructure	Yes

Every signup includes a free 30-day Plus trial with up to 10 team members. No AWS account or credit card required.

Tips for Getting Started

Based on community feedback from the Amazon Quick User Group:

Start with one specific use case rather than trying to connect everything at once. Pick the workflow that costs your team the most time and build that first.
Use Spaces to organize knowledge — upload the documents your team references most often and build agents on top of them.
Quick Flows are more accessible than they look — the natural language builder means non-technical team members can create and maintain automations without developer support.
The desktop app gets more useful over time — the knowledge graph improves as Quick learns your patterns, so the value compounds the longer you use it.

Get started: aws.amazon.com/quick/desktop

Sources: Amazon Quick features · Amazon Quick FAQs · Amazon Quick customers · Austin Quick User Group meetup · About Amazon: Quick desktop launch. Content was rephrased for compliance with licensing restrictions.

Thursday, April 16, 2026

The Long-Term Survival Plan for Humanity: What Happens After Earth?

At some point in the distant future, staying on Earth won't be an option. The Sun is slowly getting brighter, and within about a billion years, our home planet will become too hot to support life as we know it.

So what happens next? If humanity wants to survive — not just for thousands, but for millions or even billions of years — we'll need a plan.

Let's walk through the most realistic paths forward.

🌍 The Problem: Earth Has an Expiration Date

Right now, Earth is perfectly suited for life. But that won't last forever.

As the Sun ages:

Temperatures on Earth will rise
Oceans will evaporate
The atmosphere will change dramatically

Long before the Sun becomes a red giant, Earth will already be uninhabitable.

That means survival requires leaving Earth.

🚀 Option 1: Colonizing Other Planets

The first step outward is the most obvious — move to another world.

🔴 Mars

Mars is the leading candidate:

Similar day length to Earth
Evidence of water (in ice form)
Relatively close in cosmic terms

But it's far from ideal:

Thin atmosphere
Freezing temperatures
High radiation exposure

Mars won't become a second Earth anytime soon. Instead, future humans would likely live in domes or underground habitats.

🪐 Distant Moons

Other intriguing options include:

Europa — possibly hiding a vast ocean beneath its ice
Titan — with a thick, hazy atmosphere

These worlds are fascinating, but extremely hostile. For now, they're better suited for research stations than large-scale human settlement.

🏙️ Option 2: Building Homes in Space

Instead of adapting to planets, we could build our own environments.

🌀 O'Neill Cylinders

Imagine giant rotating structures in space:

Artificial gravity created by rotation
Controlled weather and ecosystems
Designed specifically for human life

These habitats could exist anywhere — from Earth orbit to the asteroid belt.

While technically challenging, many scientists believe this approach may be more practical than terraforming entire planets.

🌌 Option 3: Reaching Other Stars

Eventually, even the solar system won't be enough.

The closest star, Proxima Centauri, is over four light-years away. With today's technology, reaching it would take tens of thousands of years.

Possible solutions include:

Generation ships — where multiple generations live and die during the journey
Advanced propulsion systems far beyond what we have today
Autonomous or AI-led missions sent ahead of humans

Interstellar travel is not impossible — but it's one of the greatest engineering challenges imaginable.

🌞 Option 4: Moving Outward as the Sun Changes

As the Sun evolves, the "habitable zone" shifts outward.

In the far future:

Regions near Jupiter and Saturn may become warmer
Moons like Europa could become more hospitable

Human civilization could gradually migrate outward, staying within a livable zone for billions of years.

🤖 Option 5: Redefining What It Means to Be Human

There's also a more radical possibility: humans may not remain purely biological.

Future evolution could include:

Integration with artificial intelligence
Digital consciousness (if it becomes possible)
Machine-based life forms better suited for space

Unlike biological humans, machines could survive extreme radiation, cold, and long-duration space travel.

🧠 The Most Likely Path Forward

Rather than choosing just one option, humanity will likely follow a progression:

Expand beyond Earth
Establish colonies on nearby planets like Mars
Build large-scale space habitats
Spread throughout the solar system
Eventually attempt interstellar travel

Each step builds on the last.

⚖️ The Reality Check

None of this is easy.

The challenges aren't just scientific — they're social, political, and economic. But there's good news: we have time. A lot of it.

The real question isn't whether it's possible. It's whether we choose to pursue it.

🌌 Final Thought

Humanity's story doesn't have to end with Earth.

If anything, Earth might just be the beginning.

Wednesday, March 25, 2026

How I Fixed My Sluggish Mac in Minutes Using Kiro CLI

My MacBook had been crawling for days. Apps took forever to open, switching between windows felt like wading through mud, and I had no idea why. I only had 4 browser tabs open — nothing unusual. Then I tried Kiro CLI, and within minutes I had my answer and my fix.

Here's exactly how it went down.

The Problem

Everything was slow. Not "a little laggy" slow — genuinely unusable slow. Spinning beach balls, delayed keystrokes, the works. I'd already tried the usual suspects: restarting apps, clearing cache, the classic "turn it off and on again." Nothing helped.

Installing Kiro CLI

Getting started took less than a minute. Kiro CLI is a terminal-based AI assistant that can interact directly with your system.

brew install kiro-cli

Then just launch it:

kiro-cli chat

That's it. No complex setup, no config files to edit.

Identifying the Issue

I typed one line to Kiro:

"my system is very slow"

Kiro immediately ran a system diagnostic and surfaced this:

Load Avg: 19.14, 69.88, 67.35
PhysMem: 7489M used, 142M unused
VM: 384609609 swapins, 394054546 swapouts

Kiro's analysis was direct: load average of 19 is dangerously high (healthy is 1–4), RAM was nearly exhausted, and the system was thrashing swap — reading and writing to disk constantly, which is orders of magnitude slower than RAM.

It also spotted the likely culprit immediately: 136 Chrome Helper processes running simultaneously.

I pushed back — I only had 4 tabs open. Kiro dug deeper and found something I hadn't considered:

"There are two user accounts running Chrome — your active session and a background session via Fast User Switching. Chrome is running in both."

That was the "aha" moment. I'd switched users earlier and never logged out. The other account had Chrome fully running in the background, invisible to me.

Fixing the Root Cause

Kiro walked me through the fix step by step.

First attempt — a clean kill signal:

sudo pkill -u [other-user] "Google Chrome"

That reduced Chrome processes from 136 to 55, but didn't finish the job. Kiro checked again and found 39 processes still running under the background account. So it gave me a harder fix:

sudo kill -9 $(ps aux | grep -i "Google Chrome" | grep [other-user] | awk '{print $2}' | tr '\n' ' ')

After that, Kiro ran another check:

Load Avg: 2.49
CPU idle: 83%
Chrome processes: 19 (normal — just my active session)

Done. System restored.

How Kiro CLI Saved My Day

What would have taken me an hour of Googling, Stack Overflow rabbit holes, and trial-and-error took about 10 minutes of conversation. Kiro didn't just tell me "Chrome uses a lot of memory" — it:

Ran live diagnostics on my system
Identified the non-obvious root cause (Fast User Switching + dual Chrome sessions)
Gave me the exact commands to fix it
Verified the fix actually worked after each step

It felt less like using a tool and more like having a sysadmin sitting next to me.

Try Kiro CLI for Free

Kiro offers a 500-credit free trial when you sign up — more than enough to explore what it can do. After the trial, there's a free tier to keep using it, with paid plans starting at $20/month if you need more capacity.

👉 kiro.dev

If your Mac (or any system) ever feels inexplicably slow, just open a terminal and ask. You might be surprised how fast you get an answer.

Content was rephrased for compliance with licensing restrictions.

References:

Thursday, March 5, 2026

Spark Connect vs RDD: Understanding Modern Spark Architecture

TL;DR: Spark Connect represents a shift toward remote, DataFrame-centric development, moving away from the low-level RDD API for client-side code. Here's what that means for your data pipelines.

The Evolution of Spark APIs

Apache Spark has always offered multiple levels of abstraction, but Spark Connect marks a deliberate move up the stack. Understanding these layers is crucial for modern data engineering.

Three Layers, One Engine

At the foundation sits Spark Core — the execution engine handling task scheduling, memory management, and fault tolerance. Everything else is built on top.

The RDD (Resilient Distributed Dataset) API gave developers fine-grained control with operations like map(), filter(), and reduceByKey(). It's powerful but requires manual optimization and deep Spark knowledge.

The DataFrame/SQL API provides a declarative, schema-aware interface. Think df.groupBy().count() or pure SQL queries. The Catalyst optimizer handles query planning automatically, often outperforming hand-tuned RDD code.

What Makes Spark Connect Different?

Spark Connect introduces a client-server architecture that fundamentally changes how you interact with Spark:

Traditional Spark: Your laptop runs the full Spark runtime. You have access to everything — DataFrames, RDDs, SparkContext — but need the entire Spark distribution installed locally.

Spark Connect: Your laptop runs a thin client that sends DataFrame operations to a remote Spark cluster via gRPC. Only DataFrame/SQL APIs are supported on the client side — RDDs and SparkContext are not exposed over the remote connection.

Spark Connect vs Traditional Spark Architecture

A Real-World Example

Let's analyze web server logs to find 404 errors and top pages.

With RDDs

With Spark Connect (DataFrames)

Or even simpler with SQL

The DataFrame approach is more readable, automatically optimized, and runs remotely without a full Spark installation.

Why the Restrictions?

Spark Connect's limitations are intentional design choices:

Simpler API surface — easier to maintain and evolve
Remote-friendly — DataFrames serialize well over the network; RDD closures don't
Better practices — encourages modern, optimized patterns
Stability — client crashes don't affect server-side jobs

When You Still Need RDDs

RDDs aren't obsolete — they're just specialized. You need them for:

Custom partitioning logic (rdd.partitionBy())
Complex stateful transformations outside DataFrame capabilities
Working with truly unstructured data that doesn't fit tabular models
Fine-grained control over shuffle and execution

The key constraint: if you need RDDs on the client side, you can't use Spark Connect. You'll need a traditional Spark setup with the full runtime installed locally.

The Bottom Line

For most data engineering workloads — ETL, analytics, aggregations — Spark Connect with DataFrames is simpler, faster, and more maintainable. The Catalyst optimizer often outperforms manually-tuned RDD code, and remote execution from notebooks or IDEs is a significant convenience gain.

RDDs remain available for edge cases requiring low-level control, but for most workloads the DataFrame API is the cleaner, more future-proof choice.

Decision Framework

Choose Spark Connect when:

You want remote development from Jupyter, VS Code, or other IDEs
Your workload fits DataFrame/SQL patterns
You value automatic query optimization
You want simplified dependency management (no full Spark install locally)

Stick with traditional Spark when:

You need RDD-level control on the client
Working with DynamicFrames (AWS Glue)
Custom partitioning or complex stateful operations
Legacy codebases that can't be refactored

Most new Spark applications can — and should — be built using DataFrames, making Spark Connect a natural fit for modern data platforms.

Wednesday, July 23, 2025

Dummy API for pratice

https://dummyjson.com/docs/posts