AI कोडिंग टूल्स ने मुझे टेस्टिंग फिर से शुरू करने पर क्यों मजबूर किया

2022 की शुरुआत में, मैंने एक शांत फैसला लिया: मैंने Seahawk से आने वाली ज्यादातर WordPress और Node प्रोजेक्ट्स के लिए यूनिट टेस्ट लिखना बंद कर दिया। जोर से नहीं। इस पर कोई ब्लॉग पोस्ट नहीं। मैंने सिर्फ... बंद कर दिया। जायज़ था, मुझे लगा — हम महीने में 15 से 20 क्लाइंट साइट्स डिलीवर कर रहे थे, मेरे पास रोटेशन में तीन दूसरे डेवलपर्स थे, और जो टेस्ट मैं लिख रहा था वह किसी ने पढ़ा नहीं। मैनुअल QA असली बग्स पकड़ता था। टेस्ट सिर्फ शो था।

2023 के अंत तक तेजी से आगे बढ़ें। GitHub Copilot मेरे एडिटर में करीब आठ महीने से था। मैंने Cursor का भी इस्तेमाल करना शुरू कर दिया था किसी भी नई चीज़ के लिए। स्पीड वाकई असाधारण थी। लेकिन कुछ होने लगा। बग्स उन जगहों पर आ रहे थे जहां मैंने छुआ भी नहीं था। लॉजिक सही दिख रही थी पर एज केसेस में गलत निकली जिनके बारे में मैंने कभी सोचा ही नहीं। और सबसे बुरी बात — AI को एहसास ही नहीं था कि यह गलत है। उसने टूटे हुए कोड को उसी आत्मविश्वास से लिखा जैसे हमेशा करता है।Cursor on the side for anything greenfield. The speed was genuinely remarkable. But something started happening. Bugs were appearing in places I hadn't touched. Logic that looked correct was wrong in edge cases I'd never thought to check. And the worst part — the AI had no idea it was wrong. It wrote the broken code with the same confident indentation it always does.

तब मैंने टेस्ट्स फिर से उठाया।

---

वह समय जब मैंने टेस्टिंग छोड़ दी (और क्यों यह उस समय समझदारी का काम था)

ईमानदारी से कहूं? एक निश्चित तरह के प्रोजेक्ट के लिए, टेस्ट्स छोड़ना सही फैसला था। अगर आप WordPress में पाँच पेज का ब्रोशर साइट बना रहे हैं, तो contact form plugin के लिए PHPUnit टेस्ट्स लिखना सिर्फ दिखावा है। मैं इस फैसले के पीछे खड़ा हूँ।was the right call. If you're building a five-page brochure site in WordPress, writing PHPUnit tests for a contact form plugin is theatre. I stand by that.

Seahawk के लिए लंबे समय तक यही काम bread and butter था — high-volume, relatively low-complexity, well-defined scope। एक क्लाइंट आपको Figma फ़ाइल देता है, आप उसे बनाते हैं, QA करते हैं, और launch करते हैं। Feedback loops छोटे थे। अगर कुछ गड़बड़ा होता, तो घंटों में पता चल जाता। इस context में टेस्ट्स लिखना, Post-it note को laminate करने जैसा था।

लेकिन मैंने इस सीख को बहुत आक्रामक तरीके से generalize कर दिया। मैं सभी प्रोजेक्ट्स को brochure sites की तरह treat करने लगा। यहाँ तक कि custom WooCommerce checkout flows वाले प्रोजेक्ट्स को भी। यहाँ तक कि 2023 की शुरुआत में Frankfurt के एक क्लाइंट के लिए जो fintech dashboard हमने बनाया था — full custom REST API, JWT auth, तीन अलग user permission tiers के साथ। कोई टेस्ट्स नहीं। बस "careful manual QA।" यह arrogant था, और इसका खामियाजा भुगतना पड़ा।all projects like brochure sites. Even the ones with custom WooCommerce checkout flows. Even the fintech dashboard we built in early 2023 for a client in Frankfurt — full custom REST API, JWT auth, three different user permission tiers. No tests. Just "careful manual QA." That was arrogant, and it bit us.

Frankfurt का प्रोजेक्ट एक permissions bug के साथ ship हुआ जिससे editor-level users, filters के एक specific combination के तहत admin-level data को query कर सकते थे। हमें इसका पता तब चला जब उनकी internal team ने launch के छः हफ्ते बाद एक security review चलाया। शर्मनाक। ठीक करने योग्य। लेकिन यह वह चीज़ है जिसे एक basic integration test ने, pull request raise करने से भी पहले, flag कर दिया होता।

---

AI Coding Tools ने Actually क्या बदला

यहाँ वह चीज़ है जो ज़्यादातर लोग miss करते हैं जब वह Copilot या Cursor या इस महीने का कोई hot model के बारे में बात करते हैं: कोड सही दिखता है। यही तो समस्या है।looks right. That's the problem.

जब कोई junior developer buggy code लिखता है, तो अक्सर आप उसमें uncertainty देख सकते हैं। अजीब variable names, एक comment जो कहता है // not sure about this, एक function जो साफ़ तौर पर दो बार copy-paste किया गया है। कोड अपनी अपनी fragility को broadcast करता है। AI code ऐसा नहीं करता। यह stylistically consistent है, well-named है, और एक तरीके से structure किया गया है जो intentional दिखता है। Confidence पूरी तरह से cosmetic है।// not sure about this, a function that's clearly copy-pasted twice. The code telegraphs its own fragility. AI code doesn't. It's stylistically consistent, well-named, and structured in a way that reads as intentional. The confidence is entirely cosmetic.

Stanford के Human-Computer Interaction group के अध्ययन ने flag किया है कि AI assistants का use करने वाले developers, पहली बार में generated code पर over-trust करने की trend रखते हैं। यह मेरे अपने अनुभव के साथ align करता है। मैं एक 40-line function देखता जो Copilot ने लिखा था, सोचता "हाँ, यही वह है जो मैं लिखता," और आगे बढ़ जाता। कभी-कभी यह ठीक था। कभी-कभी इसने silently गलतफहमी की कि मुझे actually क्या चाहिए। have flagged that developers using AI assistants tend to over-trust generated code on first read. That tracks with my own experience. I would glance at a 40-line function Copilot had written, think "yeah, that's basically what I'd have written," and move on. Sometimes it was fine. Sometimes it had silently misunderstood what I actually needed.

specific failure mode जो मैं बार-बार hit कर रहा था: edge cases के around conditional logic जिसकी anticipate करने का AI को कोई कारण नहीं था। यह एक ऐसा function लिखता जो happy path को perfectly handle करता और फिर quietly null inputs, empty arrays, या non-standard date formats पर fail करता। ऐसी चीज़ें जिन्हें मैं तीस सेकंड में सोच लेता अगर मैंने code ख़ुद लिखा होता, क्योंकि मैं type करते हुए सोच रहा होता।conditional logic around edge cases the AI had no reason to anticipate. It would write a function that handled the happy path perfectly and then quietly fail on null inputs, empty arrays, or non-standard date formats. Things that would have taken me thirty seconds to think about if I'd written the code myself, because I'd have been thinking as I typed.

स्पीड ट्रैप

यहाँ एक असली प्रोडक्टिविटी ट्रैप है। AI आपको तेज़ बनाता है। तेज़ होना अच्छा लगता है। आप तेज़ी से शिप करने लगते हैं और कम सावधानी से रिव्यू करते हैं क्योंकि वेलोसिटी क्वालिटी का सबूत जैसी लगती है। यह नहीं है। जब आप लैंग्वेज मॉडल को प्रॉम्प्ट कर रहे हों तो स्पीड और करेक्टनेस आपस में जुड़े नहीं हैं।

मैंने सितंबर में एक क्लाइंट प्रोजेक्ट में तकरीबन 40% ज़्यादा फीचर्स डाले, जो मैं AI असिस्टेंस के बिना नहीं कर पाता। लेकिन प्रोजेक्ट में लॉन्च के बाद जितने बग आए, वे पिछले दो साल में मैंने जो कुछ शिप किया था उससे कहीं ज़्यादा थे। कोई कैटास्ट्रॉफिक बग नहीं थे। पर परेशान करने वाले थे। जो क्लाइंट का विश्वास घिसाते हैं।

---

टेस्ट अब अलग तरीके से काम करते हैं (AI लूप में होने के साथ)

जब मैं टेस्टिंग पर वापस आया, तो पुरानी वर्कफ़्लो में नहीं आया। पहले टेस्ट लिखना, फिर इंप्लिमेंटेशन, फिर AI-असिस्टेड कोड रिव्यू — यही लूप है जिसे मैंने अब अपना लिया है।

दिलचस्प बात यह है कि AI असल में टेस्ट लिखने में बहुत अच्छा है, जिस तरीके से वह हमेशा एप्लिकेशन लॉजिक लिखने में अच्छा नहीं है। Copilot को एक अच्छी तरीके से डिफाइन किया हुआ फंक्शन सिग्नेचर दो और उससे टेस्ट सूट जेनरेट करने के लिए कहो, और यह एज-केस कवरेज निकालेगा जो मुझे मैन्युअली लिखने में बीस मिनट लगते। यह अनहैप्पी पाथ्स अच्छे से कल्पना करता है जब टास्क विशेष रूप से "यह कैसे टूट सकता है इसके तरीके खोजो" हो।AI is actually excellent at writing tests, in a way it isn't always excellent at writing application logic. Give Copilot a well-defined function signature and ask it to generate a test suite and it'll produce edge-case coverage I'd have taken twenty minutes to write manually. It imagines unhappy paths well when the task is specifically "find ways this can break."

तो मैंने चीज़ को उलट दिया है। मैं टेस्ट स्पेक लिखता हूँ। AI टेस्ट केसेज़ भरता है। फिर AI इंप्लिमेंटेशन लिखता है। फिर मैं इंप्लिमेंटेशन को इन टेस्ट्स के लेंस से पढ़ता हूँ, बस कोड को ठंडे दिमाग से पढ़ने की जगह।through the lens of those tests, rather than just reading the code cold.

यह शुद्ध वाइब-कोडिंग से धीमा है। पर पुरानी सब कुछ मैन्युअली लिखना-टेस्ट्स भी शामिल-वर्कफ़्लो से तेज़ है। और यह Frankfurt के बाद से शून्य परमिशन्स बग्स शिप किया है।

जो टूल्स मैं असल में यूज़ कर रहा हूँ

JavaScript या TypeScript के लिए [Vitest](https://vitest.dev)। मैंने पिछले साल Jest की जगह इसने पूरी तरह ले ली — कॉन्फ़िग ज्यादा समझदारी भरा है और watch mode बहुत तेज़ है। for anything JavaScript or TypeScript. Replaced Jest for me entirely last year — the config is saner and the watch mode is quick.
WordPress और custom PHP work के लिए अभी भी PHPUnit। इसकी जगह कोई नहीं ले सका है। still, for WordPress and custom PHP work. Nothing has replaced it.
Cursor का "test this function" शॉर्टकट — सच में किसी भी editor में मैंने देखा है सबसे उपयोगी single feature। — genuinely one of the most useful single features in any editor I've used.
CI के लिए GitHub Actions। हर push पर main को tests चलते हैं। ज़्यादातर projects पर करीब 90 सेकंड लगते हैं। for CI. Tests run on every push to main. Takes about 90 seconds on most projects.

---

Tests के खिलाफ दलील (Steel-Manned)

मैं इस position को fair hearing देना चाहता हूँ क्योंकि मैंने लगभग दो साल इसे माना था।

असली दलील यह नहीं है कि "tests बेकार हैं।" यह है कि "tests की एक कीमत है और बहुत सारे projects उस कीमत को justify नहीं करते।" Test suite लिखना और maintain करना समय लेता है। एक project जिसकी उम्र छोटी है — एक campaign microsite, एक marketing landing page, एक hackathon prototype — उस समय के investment का कोई return नहीं। Project मर जाएगा tests कुछ बचाने से पहले।

और एक और सूक्ष्म बात है: बुरे tests कोई tests नहीं होने से ज़्यादा बुरे हैं। एक test suite जो इसलिए pass होता है क्योंकि tests tautological हैं (आप basically यह test कर रहे हैं कि आपका function वह return करता है जो आपने उसे return करने को कहा था) आपको false confidence देता है। मैंने agencies में ऐसा देखा है। Developers ऐसे tests लिख रहे हैं जो हमेशा pass होते हैं क्योंकि किसी ने challenge नहीं किया कि वो actually क्या verify कर रहे हैं।bad tests are worse than no tests. A test suite that passes because the tests are tautological (you're essentially testing that your function returns what you told it to return) gives you false confidence. I've seen this at agencies. Developers writing tests that always pass because nobody challenged what they were actually verifying.

Martin Fowler ने इस बारे में अच्छी तरह लिखा है — coverage percentages test quality का measure नहीं हैं। 90% coverage का नंबर एक completely hollow suite को छुपा सकता है। — coverage percentages are not a measure of test quality. A 90% coverage number can mask a completely hollow suite.

तो: सब कुछ परीक्षण न करें। परीक्षण इसलिए न करें कि यह पेशेवर लगता है। परीक्षण इसलिए करें क्योंकि आपने ऐसा तर्क पहचाना है जो भार-सहन करने वाला है और जिसे तोड़ना महंगा पड़ेगा।

---

मैं अब क्या परीक्षण करता हूँ (और क्या नहीं)

यह वह वास्तविक निर्णय है जिस पर मैं पिछले आठ या नौ महीनों के बाद पहुँचा हूँ:

मैं परीक्षण करता हूँ:

कोई भी फ़ंक्शन जो पैसे, अनुमतियों या डेटा ट्रांसफॉर्मेशन को संभालता है
कोई भी API endpoint जो सीधी CRUD पास-थ्रू नहीं है
कस्टम बिज़नेस लॉजिक जहाँ क्लाइंट ने लिखित रूप में सटीक व्यवहार निर्दिष्ट किया हो
कोई भी चीज़ जो AI ने लिखी और जिसे मैंने पूरी तरह लाइन-दर-लाइन नहीं पढ़ा

मैं परीक्षण नहीं करता:

UI rendering (स्नैपशॉट टेस्ट ने मुझे नौ सालों में एक बार भी नहीं बचाया। एक बार भी नहीं।)
Third-party API wrappers जहाँ external behaviour मेरे नियंत्रण से बाहर है
One-off scripts जो एक बार चलती हैं और फिर delete हो जाती हैं
Standard WordPress hooks जब तक वे कुछ असामान्य न कर रहे हों

बस यही है। कोई बड़ा philosophy नहीं। बस एक list उन जगहों के आधार पर जहाँ मुझे नुकसान हुआ है।

---

The Workflow That Actually Works For Me

चूंकि कुछ लोगों ने Slack communities में मुझसे पूछा है जहाँ मैं हूँ, यहाँ असल sequence है:

फ़ाइल के शीर्ष पर एक brief spec comment लिखें — यह module क्या करता है, क्या नहीं करता है, edge cases जिनके बारे में मुझे पहले से पता है।
कोई भी implementation लिखने से पहले Cursor को उस comment से test cases generate करने के लिए कहें।
उन टेस्ट केसों को देखें। बेकार वालों को डिलीट करें। जो AI ने मिस किए हों, वो add करें।
Copilot या Cursor को implementation लिखने दें।
टेस्ट चलाएं। वो fail होंगे। Implementation को ठीक करें (टेस्ट को नहीं)।
Push करने से पहले diff को पढ़ें — AI-assisted कोड को भी इंसान का चेक चाहिए।

Step 6 non-negotiable है। पिछले चार महीने में मैंने तीन genuinely खराब bugs को सिर्फ diff को धीरे-धीरे पढ़कर push करने से पहले पकड़ा है। कुछ clever नहीं। बस पढ़ना है।

Kent Beck की original TDD की समझ कभी 100% coverage या perfect methodology के बारे में नहीं थी। यह था एक feedback loop को इतना fast बनाना कि गलतियों को पकड़ा जा सके इससे पहले कि वो बढ़ें। यह idea — fast feedback loops — अब ज़्यादा relevant है बनिस्बत 2003 के। क्योंकि AI गलतियां उतनी तेजी से करता है जितनी कि किसी भी developer के पास मैंने देखी है। was never about 100% coverage or perfect methodology. It was about building a feedback loop fast enough to catch mistakes before they compound. That idea — fast feedback loops — is more relevant now than it was in 2003. Because the AI makes mistakes faster than any developer I've ever hired.

---

FAQ

क्या यह आपकी delivery speed को slow कर देता है?

Complex projects पर करीब 10 से 15% तक। Simple वालों पर, शायद कुछ भी नहीं — AI tests को इतनी जल्दी generate कर देता है कि overhead minimal है। Projects जहां bug को post-launch में fix करना real money खर्च करवा सकता हो (और ज़्यादातर real-money projects इसमें आते हैं), वह 15% सौ गुना ज़्यादा काबिल-ए-क़बूल है।

TypeScript के बारे में क्या? क्या strong typing बहुत सारे tests की जगह नहीं ले लेता?

आंशिक रूप से। TypeScript compile time पर errors की एक पूरी श्रेणी को पकड़ता है जिसके लिए आपको पहले tests की जरूरत होती थी। लेकिन types business logic को test नहीं करते। वे verify नहीं करते कि आपका discount calculation function wholesale customers के लिए सही rules apply कर रहा है। वह अभी भी आपके ऊपर है।

क्या junior developers को AI coding tools का उपयोग करना चाहिए अगर वे tests नहीं लिख रहे हैं?

नहीं। Strong opinion। एक junior developer जो tests के बिना Copilot का उपयोग कर रहा है वह basically एक plane को autopilot पर fly कर रहा है यह समझे बिना कि autopilot कैसे काम करता है या manually कैसे land करते हैं। AI senior-level दिखने वाला code produce करेगा, junior को पता नहीं चलेगा कि किन हिस्सों पर संदेह करना है, और आपको eventually एक production incident मिलेगा। Tests कम से कम उन्हें output verify करने का एक mechanism देते हैं जो वे accept कर रहे हैं।

आप testing को पहले जगह बंद क्यों किए, सच बताइए?

Burnout, आंशिक रूप से। और एक ऐसी अवधि जहां हर project genuinely सरल था और tests genuinely अपना वजन नहीं ला रहे थे। गलती यह नहीं देखना था कि जब project complexity बदल गई और तदनुसार adjust किया। वह असली सबक है — "हमेशा test करो" या "कभी test मत करो" नहीं बल्कि जानना कि एक दिया गया project किस category में आता है।

---

Testing लिखना पहले protection जैसा महसूस नहीं होता था। यह paperwork जैसा लगता था। AI ने वह बदल दिया। इसलिए नहीं कि AI बुरा है — इसने मुझे meaningfully तेज बना दिया है — लेकिन इसलिए कि इसने confident, well-formatted, plausible-looking mistakes की एक नई श्रेणी introduce की है जिसे मैं code पढ़ने के तरीके से नहीं पकड़ सकता जैसे मैं पहले करता था। Tests AI के लिए नहीं हैं। ये मेरे लिए हैं। Code को accept करने से पहले वास्तव में क्या चाहिए इसके बारे में सोचने के लिए एक forcing function।

मुझे चाहिए था कि मैं इसे दो साल पहले इसी तरह frame करता।

Pick your view

मैंने फिर से टेस्ट लिखना क्यों शुरू किया (AI ने मुझे मजबूर किया)