<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[MakeWithData]]></title><description><![CDATA[MakeWithData is where I share content to all data + AI practitioners. Whether you come from a data analyst, data engineer, data scientist, or business background, you'll find content that resonates with modern issues and new trends in the industry.]]></description><link>https://www.makewithdata.tech</link><image><url>https://substackcdn.com/image/fetch/$s_!hQXU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70751bc-402f-4ebe-bd35-7cd5e8239d0c_793x793.png</url><title>MakeWithData</title><link>https://www.makewithdata.tech</link></image><generator>Substack</generator><lastBuildDate>Wed, 13 May 2026 10:30:03 GMT</lastBuildDate><atom:link href="https://www.makewithdata.tech/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Zach King]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[makewithdata@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[makewithdata@substack.com]]></itunes:email><itunes:name><![CDATA[Zach King]]></itunes:name></itunes:owner><itunes:author><![CDATA[Zach King]]></itunes:author><googleplay:owner><![CDATA[makewithdata@substack.com]]></googleplay:owner><googleplay:email><![CDATA[makewithdata@substack.com]]></googleplay:email><googleplay:author><![CDATA[Zach King]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[AWS Networking Demystified - Part 1]]></title><description><![CDATA[Explaining AWS networking with simple home analogies, and pictures!]]></description><link>https://www.makewithdata.tech/p/aws-networking-demystified-part-1</link><guid isPermaLink="false">https://www.makewithdata.tech/p/aws-networking-demystified-part-1</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 05 Aug 2025 17:02:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BvvB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BvvB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BvvB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!BvvB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!BvvB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!BvvB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BvvB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2531054,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/167553051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BvvB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!BvvB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!BvvB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!BvvB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36484a90-2e76-43f3-89ca-e7d8693565e1_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Introduction</h2><p>Cloud networking can feel like an alphabet soup with all the fancy words and acronyms&#8212;VPC, NATs, firewalls, subnets&#8212;oh my!</p><p>Fear not, today we&#8217;re going to demystify the basics of cloud networking by explaining it as if it&#8217;s your own home network! I&#8217;ve found myself using this approach to teach AWS networking many times, and the relief I hear when it &#8220;clicks&#8221; for someone is always a joy. This post is geared towards beginners, so if you&#8217;re someone that already know what a route table is and how to manage PrivateLink connections, please subscribe for future content and save yourself the reading.</p><p>Although I&#8217;m focusing on AWS, the concepts are essentially the same in Azure or GCP, though names of services may vary.</p><h2>What is a VPC?</h2><p>Let&#8217;s start with the Virtual Private Cloud, or VPC if you like acronyms. My Azure pals will know this as a Virtual Network, or VNet, and I have to give credit to Microsoft for the better name here. Why? </p><p>Because the VPC is just a network in the cloud. But forget about the cloud for a sec, what is a network? You&#8217;re on one right now&#8212;the one you used to access this web page and read this article. Most of us have at least one network at home, with WiFi.</p><p>WiFi is just the way that you connect wirelessly to the network though, it isn&#8217;t the network itself. Alternatively you could be on your home network by physically connecting via Ethernet cable, and some other ways we won&#8217;t get into.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7iAR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7iAR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 424w, https://substackcdn.com/image/fetch/$s_!7iAR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 848w, https://substackcdn.com/image/fetch/$s_!7iAR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!7iAR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7iAR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png" width="1456" height="890" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:890,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:157721,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/167553051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7iAR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 424w, https://substackcdn.com/image/fetch/$s_!7iAR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 848w, https://substackcdn.com/image/fetch/$s_!7iAR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!7iAR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1090ee7-2b36-466d-81ee-02747b92eda0_1852x1132.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>VPC IP Addresses</h3><p>Networks have a range of IP addresses, aka a CIDR block, such as 192.168.1.0/24. When a device connects to the network, it gets an IP address assigned, such as 192.168.1.30. Every network is totally isolated, so their IP addresses are considered &#8220;private", also known as a Local Area Network (LAN). </p><p>In fact, two different networks could have the same IP address ranges, which is fine, because they don&#8217;t communicate with each other. For example, most consumer routers you buy and probably have at home right now are going to use the very same CIDR I shared above, 192.168.1.0/24, as this is just a very common private IP range.</p><p>So think of a VPC as another network, like the one you have at home. Or to make it more fun, flip it: think of your home network as the VPC going forward.</p><h2>Subnets</h2><p>Next is subjects. If you have an IT background or learned networking before cloud computing was all the rage, you&#8217;d know these as VLANs.</p><p>Subnets simply divide your VPC&#8212;or network&#8212;into smaller chunks. Subnets have two primary uses:</p><ol><li><p>Organize parts of the network for logical purposes</p></li><li><p>Distinguish with parts of the network are public vs. private</p></li></ol><p>Again, all of this can be done at home as well, using VLANs you would typically configure in your home gateway or router.</p><h3>Logical Separation</h3><p>Organizing chunks of the network for different logical purposes is useful for maintainability, security, and scalability.</p><p>For example, say you start using a ton of IoT devices like smart bulbs and plugs at home. IoT devices are notorious for having more security risk, so we&#8217;d ideally want to segregate those to their own subnet/VLAN so they can&#8217;t reach our important trusted devices like mobile phones, laptops, desktops. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hyyf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hyyf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 424w, https://substackcdn.com/image/fetch/$s_!hyyf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 848w, https://substackcdn.com/image/fetch/$s_!hyyf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 1272w, https://substackcdn.com/image/fetch/$s_!hyyf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hyyf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png" width="1456" height="1173" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1173,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:351730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/167553051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hyyf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 424w, https://substackcdn.com/image/fetch/$s_!hyyf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 848w, https://substackcdn.com/image/fetch/$s_!hyyf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 1272w, https://substackcdn.com/image/fetch/$s_!hyyf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44320257-0eb9-4384-9570-e2786bbb1509_1782x1436.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each of these devices connects to the network, so they each get an IP address. So we&#8217;d need to size their subnet accordingly, making sure it has enough available IP addresses to give to each IoT device and allow room for future expansion. Still, other subnets may only require a few IP addresses. Sizing subnets is important because it can be difficult to change afterwards.</p><h3>Public vs. Private Subnets</h3><p>The other use for subnets I mentioned was private vs. public. By default, our subnets can&#8217;t access the Internet&#8212;we&#8217;ll get to that next.</p><p>Coming from outside-in though, we sometimes need to reach a device, or IP, from outside our network. One of the ways to do this is by connecting over the public Internet. Earlier I said networks are private though? Right, so do this we need a <em>public</em> IP address. Unlike the private IP addresses (e.g. 192.168.1.30), a public IP is one that anyone in the world on the public Internet can see and reach.</p><p>In the cloud, public subnets are the only ones that can have public IPs. Though this doesn&#8217;t mean every IP in a public subnet is a public IP. That said, for security we should try to only use public subnets for things that must have a public IP address, such as a public-facing application load balancer, SSH and FTP servers, etc.</p><h2>Gateways: Internet Access</h2><p>So how does stuff in the network access the Internet? Sure, that&#8217;s pretty important for most networks, whether we&#8217;re just loading web pages at home, or having an application download its dependencies.</p><p>This is where gateways come in, and again you&#8217;re actually using one right now to read this article!</p><p>There&#8217;s two types of gateways to know about:</p><ol><li><p>NAT Gateway: lets private subnets access the Internet, but not the other way around.</p></li><li><p>Internet Gateway (IGW): lets public subnets have both inbound &amp; outbound Internet access.</p></li></ol><h3>NAT Gateway</h3><p>NAT stands for Network Address Translation. Just a fancy way of saying that it translates private IP addresses (e.g. 192.168.1.30) to a single public IP address on the Internet.</p><p>For example, pick two different computers/phones at home right now and on each of them go to <a href="https://www.whatsmyip.org/">https://www.whatsmyip.org/</a>. You&#8217;ll see the same public IP address reported, even though you used two different devices. That&#8217;s because most consumer-grade home routers also serve as the NAT Gateway in home networks. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iaSU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iaSU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 424w, https://substackcdn.com/image/fetch/$s_!iaSU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 848w, https://substackcdn.com/image/fetch/$s_!iaSU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!iaSU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iaSU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png" width="1456" height="1268" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1268,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:298675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/167553051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iaSU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 424w, https://substackcdn.com/image/fetch/$s_!iaSU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 848w, https://substackcdn.com/image/fetch/$s_!iaSU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!iaSU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd84bc77-55b7-4b2d-8210-34b0cda54a9b_1876x1634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In more technical terms, that public IP address you saw was the current public IP address assigned by your Internet Service Provider (ISP) to your Wide Area Network (WAN). <strong>Note:</strong> your ISP may change your public IP at any point, and this is quite common for dynamic IPs.</p><p>So think of it like a funnel for your private network to access the Internet, and remember this behavior of the single public IP address. It will come in handy when we talk about firewalls.</p><h3>Internet Gateway (IGW)</h3><p>The other is Internet Gateways. Internet Gateways are for public subnets, and they can allow inbound traffic from the Internet too. </p><p>Put simply, if your instance/machine has a public IP, you need a Internet Gateway for it to communicate with the Internet.</p><h2>Security Groups</h2><p>The last piece we&#8217;ll learn today is Security Groups. These are a virtual firewall for your networks or instances. That means it has ingress and egress rules that define what IP ranges and ports are allowed to receive or transmit on respectively.</p><p>By default, everything on the same network can reach one another. To lock things down, we use Security Groups and set ingress and egress rules. For example, only allowing things from a &#8220;Application&#8221; private subnet to talk to a &#8220;Databases&#8221; subnet on port 3306 for MySQL. Or opening port 443 for a public web server.</p><p>An easy way to test if you can reach a certain port at a target IP is with <code>nc -vz &lt;ip&gt; &lt;port&gt;</code> like: <code>nc -vz 10.0.0.92 3306</code></p><h2>Let&#8217;s wrap up</h2><p>These are the same essentials you&#8217;ll use in cloud providers like AWS! No voodoo magic going on here, just good ol&#8217; networking but shifted into on-demand scalable cloud computing! Now go tell your friends, family, or neighbors how simple it all is! </p><h3>Want more?</h3><p>If this has helped you understand cloud networking better, please consider following/subscribing to my newsletter! </p><p>You may be wondering how to secure traffic <em>across</em> private networks <em>without</em> it going over the public Internet. On that note, if I see enough interest through comments/likes, I&#8217;ll create a Part 2 to go into intermediate concepts like VPC peering, Transit Gateway, and PrivateLink!</p>]]></content:encoded></item><item><title><![CDATA[Frontend is dead. Long live CLI.]]></title><description><![CDATA[The best interface to offer your users might just surprise you.]]></description><link>https://www.makewithdata.tech/p/frontend-is-dead-long-live-cli</link><guid isPermaLink="false">https://www.makewithdata.tech/p/frontend-is-dead-long-live-cli</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Mon, 02 Jun 2025 13:02:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QCgl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QCgl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QCgl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!QCgl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!QCgl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!QCgl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QCgl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1508459,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/164376402?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QCgl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!QCgl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!QCgl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!QCgl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb40226-66dd-4647-bf51-0264a4259179_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Simpler is Better</h2><p>Ok, the title may be a bit reductive, but seriously, simpler is (usually) better. Let&#8217;s talk about the underdog of interfaces: the Command Line Interface (CLI).</p><p>Let me guess, you&#8217;ve got a small team (or are part of one), with a short deadline and a brilliant product idea. What do you do?</p><p>Spin up a backend and design a REST API. Oh users need something to see and control so throw in a shiny React frontend. REST API getting complicated? Split it into backend API and Backend For Frontend (BFF) API. Great, two months go by and the product is still not ready. In fact, I&#8217;d probably spend 3 weeks just debugging CORS issues and arguing over UI layout in Figma.</p><p>Or&#8230;do you start with a CLI that solves the problem in a few days?</p><p>I know this is brazen, and one size never fits all. I&#8217;ll get to that, too. The point is that we&#8217;re so sold on the idea that &#8220;real software products&#8221; need a polished web UI and/or desktop applications that we forget the humble CLI. </p><p>In fact, CLI is making a comeback (or maybe it was never quite lost, just forgotten). **Especially** with the rise of AI agents and standards like Model Context Protocol (MCP).</p><p></p><h2>Interface Trade-off: Frontend vs CLI</h2><p>I love pro/con lists, so let&#8217;s do that.</p><h3>Frontend (UI)</h3><ul><li><p>&#9989; Looks great (hopefully)</p></li><li><p>&#9989; Great for all users, even non-technical</p></li><li><p>&#9989; Web applications keep users on latest version easily</p></li><li><p>&#10060; Good UI/UX is hard and time-consuming. Most of us suck at it.</p></li><li><p>&#10060; Highest development and QA time of all. Slowest velocity.</p></li><li><p>&#10060; Yet another layer that can break/have bugs</p></li><li><p>&#10060; Compatibility can be a nightmare</p></li><li><p>&#10060; Hard for AI to integrate well with</p></li></ul><p>For most teams, even highly skilled ones, UIs where ideas go to die a slow, expensive death.</p><h3>CLI</h3><ul><li><p>&#9989; Straightforward architecture. Some don&#8217;t even need an API to talk to.</p></li><li><p>&#9989; Easy to test</p></li><li><p>&#9989; Scriptable, automatable, easy to integrate with AI</p></li><li><p>&#9989; Highest velocity to develop and maintain</p></li><li><p>&#10060; User must know their way around a terminal</p></li><li><p>&#10060; Limited visual capabilities. TUI libraries exist but are niche. Note: some of us consider this a &#9989; too.</p></li></ul><p>To be clear, data and security are the common thread between these. You&#8217;re (almost) always going to start with an API to secure client/server access to some data, unless your CLI is static or client-side-only. In more mature organizations, the API is usually first, and CLI and/or SDKs use code-generation to automate development or make it consistent with the API.</p><p></p><h2>Why Velocity Matters?</h2><p>Shipping fast matters. Not just in startups, and not just at the onset of a new project.</p><p>Every extra interface layer is a recurring tax on your design, development, QA, and release times.</p><p>I&#8217;ve seen really innovative product ideas fail because they took too long to build, or because the second a significant feature was needed, it took too long to work cross-functionally to make it a reality.</p><p>Starting with a CLI when practical can reduce your time-to-value from months to mere days, or hell, even hours. You can always add a UI later if the demand justifies it.</p><p></p><h2>Agentic AI: zero-code integration with CLIs</h2><p>Let&#8217;s talk about how AI is reshaping interfaces. </p><p>The world is drooling over Function Calling, Model Context Protocol (MCP), A2A, and probably more protocols by next week. They all rely on invoking structured, deterministic behavior.</p><p>That&#8217;s CLI 101.</p><p>AI agents need predictable inputs, consistent outputs, and knowledge of available tasks. This is trivial with CLIs, especially with many CLI frameworks, like <a href="https://github.com/clap-rs/clap">clap</a> and <a href="https://github.com/spf13/cobra">cobra</a>, offering built-in shell auto-completion and robust help outputs.</p><p>Check out this ~1 minute demo by Block, giving their AI Agent (Goose) the power to integrate with Databricks simply by giving it the Databricks CLI &#129327;</p><div id="youtube2--IxEl97Wv1E" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;-IxEl97Wv1E&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/-IxEl97Wv1E?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>That&#8217;s crazy; they didn&#8217;t write a custom MCP server or have to use some janky &#8220;accessibility&#8221; AI to control the browser, keyboard, and mouse. They just used the existing CLI. Simple.</p><p></p><h2>Final Thoughts: Less Chrome, More Commands</h2><p>If you&#8217;re launching something soon, ask yourself:</p><ul><li><p>Could this be a CLI first?</p></li><li><p>Will AI tools need to interact with this later?</p></li></ul><p>Or maybe you already have an API you&#8217;re proud of, you might consider adding a CLI to make it easier to script or integrate with AI tools.</p><p>Not sure where to start? If you have a client/server pattern, start with the API, full stop. </p><p>When you&#8217;re ready to code a CLI, I highly recommend <a href="https://golang.org/">Go</a>. Rust is also really popular for building CLIs, but I&#8217;m currently re-learning Rust for the 4th time, so the learning curve is steep.</p>]]></content:encoded></item><item><title><![CDATA[Are You Leaking Secrets with Terraform?]]></title><description><![CDATA[Make your Terraform code bulletproof when it comes to sensitive values.]]></description><link>https://www.makewithdata.tech/p/are-you-leaking-secrets-with-terraform</link><guid isPermaLink="false">https://www.makewithdata.tech/p/are-you-leaking-secrets-with-terraform</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 27 May 2025 13:02:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!r6xC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r6xC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r6xC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!r6xC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!r6xC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!r6xC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r6xC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2009868,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/163520414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r6xC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!r6xC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!r6xC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!r6xC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda8a6122-0e37-4e39-a607-b7cff1612d4b_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>A 2022 <a href="https://blog.gitguardian.com/the-state-of-secrets-sprawl-2022/">GitGuardian report</a> revealed that, on average, <strong>3 out of every 1,000 commits</strong> <strong>contained a credential</strong>, and the number of leaked secrets has been rising sharply each year.</p><p>C&#8217;mon folks, let&#8217;s  try harder to keep our companies out of the news.</p><p>No matter your role, if you contribute code of any variety, you should be aware of best practices when it comes to using secrets in that code. Today though, let&#8217;s focus on Terraform and look at real world examples.</p><h2>(Insecure) Hardcoding sensitive values</h2><p>Take the following Terraform example. It&#8217;s a classic use case that deploys a Lambda Function with an environment variable containing an API Key to a third-party API or database.</p><pre><code>resource "aws_lambda_function" "test_lambda" {
  filename      = "lambda_function_payload.zip"
  function_name = "lambda_function_name"
  role          = aws_iam_role.iam_for_lambda.arn
  handler       = "index.test"
  source_code_hash = data.archive_file.lambda.output_base64sha256
  runtime = "nodejs18.x"

  environment {
    variables = {
      # &#9888;&#65039; DO NOT DO THE FOLLOWING &#9888;&#65039;
      API_KEY = "390de376-a86d-4d5d-9ff7-6ca977a3f2aa"
    }
  }
}</code></pre><p>Great, the stage is set. It&#8217;s obvious why you shouldn&#8217;t do this: the secret is in plaintext and even stored in Git. By the way, if you think it&#8217;s easy to remove secrets from Git once they&#8217;re committed, <a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository">think again</a>.</p><p></p><h2>(Insecure) Mark variables as sensitive</h2><p>Fortunately, Terraform has this thing called <a href="https://developer.hashicorp.com/terraform/tutorials/configuration-language/sensitive-variables">sensitive variables</a> we can use. </p><pre><code>variable "api_key" {
  description = "Key credential for the API"
  type        = string
  sensitive   = true
}</code></pre><p>Great, problem solved, right!? Nope.</p><p>Marking variables as sensitive in Terraform is good practice, but it only provides basic protection against <em>some</em> accidental exposure in the CLI and log output. This does NOT encrypt the secret value in the state file; you&#8217;re still at risk if this is all you do.</p><p>Let&#8217;s look at options that are considered truly &#8220;secure.&#8221;</p><p></p><h2>HCP Terraform &amp; HashiCorp Vault</h2><p>If you use HCP Terraform or Terraform Enterprise, you&#8217;ll rest easy knowing that it <a href="https://developer.hashicorp.com/terraform/cloud-docs/workspaces/variables/managing-variables#sensitive-values">encrypts all variable values</a>. </p><p>Likewise, <a href="https://developer.hashicorp.com/terraform/tutorials/secrets/secrets-vault">HashiCorp Vault</a> is a great solution to securing your secrets in Terraform.</p><p>However, I don&#8217;t care about paying a premium for a feature we can do pretty trivially on our own. Let&#8217;s look at DIY options.</p><p></p><h2>AWS Solutions</h2><h3>Integrating Secrets Manager</h3><p>AWS Secrets Manager is a managed service for&#8230;yep, managing secrets. Secrets Manager is one of the go-to tools for securing sensitive values for applications hosted on AWS, and we can tap into this power for Terraform too.</p><p>The process is fairly simple:  </p><ol><li><p>Create the secret in AWS Secrets Manager. Use whatever method you prefer (AWS Console, CLI, etc.), just not Terraform. Take note of the Secret ID.</p></li><li><p>Use Terraform&#8217;s <a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/secretsmanager_secret_version">aws_secretsmanager_secret_version</a> data source to query the Secret by its ID (you can use a regular TF variable to pass the ID if you wish).</p></li><li><p>Use Terraform&#8217;s <code>jsondecode()</code> function on the data source&#8217;s <code>.secret_string</code> attribute.</p></li></ol><pre><code><code>data "aws_secretsmanager_secret_version" "api_key" {
  secret_id = "arn:aws:secretsmanager:us-east-1:123456789012:secret:my_api_key-123456"
}

locals {
  api_key = jsondecode(data.aws_secretsmanager_secret_version.api_key.secret_string)
}</code></code></pre><p>Pros:</p><ul><li><p>Centralizes where secrets are managed.</p></li><li><p>Access to secrets can be limited to read-only.</p></li><li><p>Access can be revoked even if user has access to the Terraform code.</p></li></ul><p>Cons:</p><ul><li><p>Sensitive value plaintext can still be stored in Terraform state file if the decrypted value is stored in a resource&#8217;s unprotected attributes.</p></li><li><p>Requires a non-Terraform way of creating and editing secrets initially.</p></li></ul><p></p><h3>KMS-Encrypted Values</h3><p>AWS Key Management Service (KMS) is another acceptable solution similar to Secrets Manager. With KMS, the process is nearly identical:</p><ol><li><p>Use the <a href="https://docs.aws.amazon.com/cli/latest/reference/kms/encrypt.html">AWS CLI for KMS</a> to encrypt the plaintext secret value and get its ciphertext.</p></li><li><p>Create a local or variable in Terraform to store the KMS ciphertext.</p></li><li><p>Use Terraform&#8217;s <a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/kms_secrets">aws_kms_secrets</a> data source to dynamically decrypt the ciphertext to its usable value.</p></li></ol><pre><code><code>locals {
  api_key_ciphertext = "AQECAHgaPa0J8WaeplGCqqVAr4HNvDaFSQ+NaiwIBhmm6qDSFwAAAGIwYAYJKoZIhvcNAQcGoFMwUQIBADBMBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDI+LoLdvYv8l41OhAAIBEIAfx49FFJCLeYrkfMfAw6XlnxP13MmDBdqP8dPp28OoBQ=="
}

data "aws_kms_secrets" "this" {
  secret {
    name    = "api_key"
    payload = local.api_key_ciphertext
  }

  # &lt;more secret{} blocks here&gt;
}</code></code></pre><pre><code>locals {
  api_key = data.aws_kms_secrets.this.plaintext["api_key"]
}</code></pre><p>Pros:</p><ul><li><p>Access to secrets can be limited to read-only via the KMS keys used to encrypt them.</p></li><li><p>Access can be revoked even if user has access to the Terraform code.</p></li><li><p>Sensitive values can be safely commited to VCS like Git, as it is ciphertext. This can make development more transparent, and provide better history for auditing.</p></li><li><p>More control over how the encryption key is managed, if using Customer-Managed Keys (CMK), such as key rotation and multi-region keys.</p></li></ul><p>Cons:</p><ul><li><p>Sensitive value plaintext can still be stored in Terraform state file if the decrypted value is stored in a resource&#8217;s unprotected attributes. </p><ul><li><p>Note: if you are using Terraform v1.10 or later, you can use <a href="https://developer.hashicorp.com/terraform/language/state/sensitive-data#ephemeral-data">Ephemeral values</a> to prevent it from being stored in state at all.</p></li></ul></li><li><p>Requires a non-Terraform way of creating and editing secrets initially.</p></li></ul><p></p><h2>Cloud-Agnostic Solutions</h2><p>Not using AWS? No problem, you can still do this.</p><h3>Integrating SOPS</h3><p><a href="https://getsops.io/">SOPS</a> is an open-source, free, CNCF tool for encrypting and decrypting secrets. It&#8217;s cloud-agnostic so you can choose your preferred backer: AWS KMS, Azure Key Vault, GCP KMS, as well as age and GPG for generic environments.</p><p>I found this <a href="https://dev.to/hkhelil/secure-secret-management-with-sops-in-terraform-terragrunt-231a">blog by dev.to</a> very useful if you&#8217;d like to setup SOPS with Terraform/Terragrant.</p><p>This essentially involves the following process:</p><ol><li><p>Store your secrets in a file like <strong>secrets.yaml</strong></p></li><li><p>Encrypt the file using SOPS CLI (you can also use a Terraform null resource with local provisioner).</p></li><li><p>Use the SOPS Terraform provider&#8217;s <a href="https://registry.terraform.io/providers/carlpett/sops/latest/docs/data-sources/file">sops_file</a> data source to decrypt the file and use throughout your Terraform configuration.</p></li></ol><pre><code># secrets.yaml
api_key: "supersecret"</code></pre><pre><code>data "sops_file" "secrets" {
  source_file = "${path.module}/secrets.enc.yaml"
}

locals {
  api_key = data.sops_file.secrets.data["api_key"]
}</code></pre><p>Pros:</p><ul><li><p>Extremely flexible. Encryption methods can be mixed and matched to various providers.</p></li><li><p>Supports both cloud-managed and local services.</p></li><li><p>Sensitive values can be safely commited to VCS like Git, as it is ciphertext. This can make development more transparent, and provide better history for auditing.</p></li></ul><p>Cons:</p><ul><li><p>Sensitive value plaintext can still be stored in Terraform state file if the decrypted value is stored in a resource&#8217;s unprotected attributes. </p><ul><li><p>Note: if you are using Terraform v1.10 or later, use <a href="https://developer.hashicorp.com/terraform/language/state/sensitive-data#ephemeral-data">Ephemeral values</a> to prevent it from being stored in state at all.</p></li></ul></li><li><p>Key rotation is supported but may be more difficult to implement when using local GPG or <strong>age</strong> keys.</p></li></ul><p></p><h2>Detect Leaked Secrets</h2><p>Time for your homework assignment. That&#8217;s right, action!</p><p>You can take a proactive approach to keeping your code (infrastructure and otherwise) secure by running a free, open-source, tool like <a href="https://github.com/trufflesecurity/trufflehog">trufflehog</a>!</p><p>I&#8217;m not sponsored by anyone, but seriously&#8230;with just a few minutes, you can install trufflehog and run it against your local Git repo.</p><p>Tools like these can even be executed as pre-commit and pre-receive hooks, as well as CI/CD, so you don&#8217;t accidentally leak secrets in the first place. </p><p></p><h2>What we learned.</h2><p>Managing secrets is an every-day task for DevOps and platform engineers. These secrets can be a major attack vector if we&#8217;re not careful.</p><p>We&#8217;ve seen the examples of what NOT to do. With multiple solutions for AWS and other environments, we know what secure Terraform secrets should look like.</p><p>I expect figures like those reported by GitGuardian at the beginning will continue to increase each year; however, hopefully these best practices will help you reading this to break that trend.</p>]]></content:encoded></item><item><title><![CDATA[Building a Real-Time Satellite Tracking Pipeline with Spark Structured Streaming]]></title><description><![CDATA[Leverage the latest features of Apache Spark 4 to stream, process, and visualize satellite orbit telemetry using public TLE data and real-time processing techniques.]]></description><link>https://www.makewithdata.tech/p/building-a-real-time-satellite-tracking</link><guid isPermaLink="false">https://www.makewithdata.tech/p/building-a-real-time-satellite-tracking</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 15 Apr 2025 13:03:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VgV7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VgV7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VgV7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!VgV7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!VgV7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!VgV7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VgV7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1485342,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/161239932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VgV7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!VgV7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!VgV7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!VgV7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7a6154a-1d57-400a-bbe6-0692cac757f9_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1>&#10024; Introduction</h1><p>If you&#8217;re into space, objects in orbit, and data&#8230; I have a treat for you today! I have really enjoyed getting more familiar with Pyspark 4&#8217;s <a href="https://docs.databricks.com/aws/en/pyspark/datasources">Custom Data Sources</a>, and recently wanted to explore more datasets for real-time streaming use cases.</p><p>One day, I stumbled across this&nbsp;<a href="https://tle.ivanstanojevic.me/">TLE API</a>, a public API providing data from&nbsp;<a href="https://celestrak.com/">CelesTrak</a>, a non-profit that hosts data for the space community. </p><blockquote><p>Satellite data is super compelling for real-time use cases and they&#8217;re constantly moving. Tens of thousands of these objects (over 63k known to the public in fact) are in orbit for a variety of purposes: communication satellites like Starlink, weather satellites, navigation/GPS satellites, reconnaissance satellites, and even natural satellites like asteroids or the moon.</p></blockquote><p>In this post, we&#8217;ll build a real-time streaming pipeline using Apache Spark to track satellites using Two-Line Element (TLE) data, predict their near-future positions, and visualize their motion in 3D.</p><h1>&#128268; Data Sources</h1><p>First, we need to identify our data sources. How does one get this data, much less determine a satellite's position or velocity? If you work in space telemetry or are already a hobbyist, you probably already know the answer; for the rest of us, let&#8217;s take a moment to understand a piece of data known as <strong>TLE (Two-Line Element)</strong>.</p><h2>&#127963;&#65039; A Brief History of TLE</h2><p><strong>TLEs (Two-Line Element Sets)</strong>: The standard format for describing satellite orbits. </p><p>It&#8217;s called &#8220;two-line&#8221; because it literally consists of <strong>two lines of 69-character data</strong> that encode orbital elements necessary to calculate a satellite&#8217;s position and velocity at a given time.</p><p>Format example:</p><pre><code><code>ISS (ZARYA)
1 25544U 98067A   24101.21315433  .00002243  00000+0  48609-4 0  9991
2 25544  51.6392  59.7121 0003000  73.4153 286.7121 15.50011348397460</code></code></pre><p>The US Air Force developed this format, and it was adopted by NORAD (North American Aerospace Defense Command) during the Cold War to track objects in space. It was designed to be machine-readable, which at the time meant early mainframe systems and punch cards!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qa8R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qa8R!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 424w, https://substackcdn.com/image/fetch/$s_!qa8R!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 848w, https://substackcdn.com/image/fetch/$s_!qa8R!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 1272w, https://substackcdn.com/image/fetch/$s_!qa8R!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qa8R!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif" width="728" height="287.02508960573476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:558,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qa8R!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 424w, https://substackcdn.com/image/fetch/$s_!qa8R!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 848w, https://substackcdn.com/image/fetch/$s_!qa8R!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 1272w, https://substackcdn.com/image/fetch/$s_!qa8R!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71d0560e-9896-4f11-9c74-c95a484c6be5_558x220.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Credit: https://spaceflight.nasa.gov/realdata/sightings/SSapplications/Post/JavaSSOP/SSOP_Help/tle_def.html</figcaption></figure></div><p>For details on each field encoded into this format, check out the <a href="https://en.wikipedia.org/wiki/Two-line_element_set#Format">Wikipedia for TLE</a>.</p><h2>Public Sources</h2><p>TLE is publicly available from sources like:</p><ul><li><p><a href="https://celestrak.org/">Celestrak</a> (or APIs like <a href="https://tle.ivanstanojevic.me/#/">TLE API</a> built-on Celestrak data)</p></li><li><p><a href="https://www.space-track.org/">Space-Track.org</a></p></li></ul><p>TLEs are updated frequently and can be streamed periodically.</p><h2>Simplified Perturbations Models (SGP/SDP)</h2><p>Unfortunately, TLE cannot tell us a satellite's position or velocity by itself. For this, we need to use an established mathematical algorithm like SGP4 or SDP4, which are known as <a href="https://en.wikipedia.org/wiki/Simplified_perturbations_models">Simplified Perturbations Models</a>. </p><p>Don&#8217;t worry&#8212;we don&#8217;t have to be a rocket scientist to understand this. Think of it as a function of two inputs&#8212;the TLE and a timestamp. In other words, we can say, &#8220;Based on the TLE, where will this object be at this point in time?&#8221;</p><h1>&#127959;&#65039; Architecture Overview</h1><ul><li><p><strong>Ingestion: </strong>TLE data is streamed in with a custom spark source we build that calls the HTTP TLE API, ingesting TLE for one or more satellites.</p></li><li><p><strong>Processing: </strong>Transform the satellites&#8217; state into a usable format (coordinates we can plot relative to Earth).</p></li><li><p><strong>Prediction:</strong>&nbsp;Use SGP/SDP models combined with TLE to predict the satellites' position and velocity at various points in time.</p></li><li><p><strong>Visualization: </strong>3D plots in matplotlib.</p></li></ul><h1>&#128225; Ingesting Satellite Data</h1><p>We&#8217;ll cover a lot of code here, but I&#8217;ll break it down step by step. As a prerequisite, know that this will require Pyspark 4 (I used <code>pyspark==4.0.0.dev1</code> at the time of this writing).</p><h2>Project Setup</h2><p>First, we will bootstrap the project with some dependencies. I am using <a href="https://docs.astral.sh/uv/getting-started/">uv</a> to manage the Python project.</p><pre><code>uv init --python python3.11 spark-ingest-tle
cd spark-ingest-tle
uv add \
  grpcio \
  grpcio-status \
  pyspark==4.0.0.dev1 \
  astropy \
  sgp4 \
  matplotlib \
  pandas</code></pre><p></p><p>Now for the main code, let&#8217;s take care of our imports and setting up the Spark Session:</p><pre><code>import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pyspark.sql.datasource import DataSource, DataSourceStreamReader, DataSourceReader, InputPartition
from pyspark.sql.types import StructType
from pyspark.sql import SparkSession, functions as F

import requests
from datetime import datetime, timedelta

spark = SparkSession.builder.appName("SatelliteStream").getOrCreate()</code></pre><h2>Implementing Streaming Reader</h2><p>To support streaming with our custom source, we must implement the <code>DataSourceStreamReader </code>from PySpark. This provides the interface we must use to implement how partitions are generated, how data is read, and how offsets/checkpoints are defined.</p><pre><code>def fetch_tle_data(norad_id: str, api_key: str) -&gt; dict:
    headers = {"Accept": "*/*", "User-Agent": "curl"}
    url = f"https://tle.ivanstanojevic.me/api/tle/{norad_id}"
    response = requests.get(url, headers=headers, params={"api_key": api_key})
    response.raise_for_status()
    return response.json()

class SatellitePartition(InputPartition):
    def __init__(self, norad_id: str, name: str,tle_line1: str, tle_line2: str, tle_timestamp: datetime, size: int):
        self.norad_id = norad_id
        self.name = name
        self.tle_line1 = tle_line1
        self.tle_line2 = tle_line2
        self.tle_timestamp = tle_timestamp
        self.size = size

class SatelliteStreamReader(DataSourceStreamReader):
    def __init__(self, schema: StructType, options: dict):
        self.norad_id = options.get("norad_id").split(",")
        self.size = int(options.get("size", 5))
        self.timedelta = options.get("timedelta", "seconds=1").split("=")
        self.timedelta = {self.timedelta[0]: int(self.timedelta[1])}
        self.api_key = options.get("api_key", "DEMO_KEY")

    def initialOffset(self):
        """
        Return the initial offset of the streaming data source.
        A new streaming query starts reading data from the initial offset.
        If Spark is restarting an existing query, it will restart from the check-pointed offset
        rather than the initial one.

        Returns
        -------
        dict
            A dict or recursive dict whose key and value are primitive types, which includes
            Integer, String and Boolean.
        """
        return {}
    
    def latestOffset(self) -&gt; dict:
        """
        Returns the most recent offset available.

        Returns
        -------
        dict
            A dict or recursive dict whose key and value are primitive types, which includes
            Integer, String and Boolean.
        """
        # Fetch the TLE data for each of the given satellites
        offsets = {}
        for norad_id in self.norad_id:
            tle = fetch_tle_data(norad_id, self.api_key)
            tle_line1 = tle["line1"]
            tle_line2 = tle["line2"]
            name = tle["name"]
            tle_timestamp = tle["date"]
            offsets[norad_id] = {"name": name, "tle_line1": tle_line1, "tle_line2": tle_line2, "tle_timestamp": tle_timestamp}
        return offsets

    def partitions(self, start: dict, end: dict):
        partitions = []
        for id, val in end.items():
            if id not in start:
                partitions.append(SatellitePartition(id, val["name"], val["tle_line1"], val["tle_line2"], val["tle_timestamp"], self.size))
            else:
                if start[id]["tle_line1"] != val["tle_line1"] or start[id]["tle_line2"] != val["tle_line2"] or start[id]["tle_timestamp"] != val["tle_timestamp"]:
                    partitions.append(SatellitePartition(id, val["name"], val["tle_line1"], val["tle_line2"], val["tle_timestamp"], self.size))
        return partitions

    def read(self, partition: SatellitePartition):
        from datetime import datetime, timedelta
        from sgp4.api import Satrec, jday, SGP4_ERRORS
        
        satellite = Satrec.twoline2rv(
            partition.tle_line1,
            partition.tle_line2
        )

        for i in range(partition.size):
            ts = datetime.now() + (timedelta(**self.timedelta) * i)
            jd, fr = jday(
                ts.year,
                ts.month,
                ts.day,
                ts.hour,
                ts.minute,
                ts.second,
            )

            e, r, v = satellite.sgp4(jd, fr)
            if e != 0:
                e = SGP4_ERRORS.get(e, 'Unknown error')
            r = {"x": r[0], "y": r[1], "z": r[2]}
            v = {"x": v[0], "y": v[1], "z": v[2]}
            yield (
                ts,
                r,
                v,
                e,
                partition.norad_id,
                partition.name,
                datetime.strptime(partition.tle_timestamp, "%Y-%m-%dT%H:%M:%S%z"),
                partition.tle_line1,
                partition.tle_line2,
                jd,
                fr
            )</code></pre><p>Let&#8217;s break it down.</p><p>It starts with the constructor <code>__init__</code> which is where we&#8217;ll grab any custom options we&#8217;d like to define. </p><pre><code>    def __init__(self, schema: StructType, options: dict):
        self.norad_id = options.get("norad_id").split(",")
        self.size = int(options.get("size", 5))
        self.timedelta = options.get("timedelta", "seconds=1").split("=")
        self.timedelta = {self.timedelta[0]: int(self.timedelta[1])}
        self.api_key = options.get("api_key", "DEMO_KEY")</code></pre><p>These are the options we can pass in like so: <code>spark.readStream.option(&#8220;norad_id&#8221;, &#8220;33499&#8221;)</code>&#8230;</p><p>Disregard the schema argument here for now. We won&#8217;t be using this since our data source will have a fixed schema based on the TLE API.</p><h3>initialOffset and latestOffset</h3><p><strong>Offsets </strong>are how Spark Structured Streaming keeps track of which data has already been processed in a stream. Think of it like a bookmark, or a cursor, in database terms.</p><p>When developing your own custom pyspark source, you get to define what these offsets look like. How you choose to model this will heavily depend on the system(s) that your source is consuming; in a durable stream like Kafka or Kinesis, this would involve storing attributes like the <strong>sequenceNumber</strong> and <strong>shardId</strong>. In our case, the TLE API doesn&#8217;t offer a way to retrieve previous data, so we just won&#8217;t worry about replayability; however, the API does return a <strong>date</strong> field, and we can check if the date, or the contents of line1 or line2, have changed since the last time we fetched the TLE.</p><h4>Our Offset Data Model</h4><p>Since I want to compare these values each time I poll the API, I&#8217;ll store these things in my offsets. Therefore, the offsets will look like this:</p><pre><code>{
  satellite_id: {
    "name": name,
    "tle_line1": tle_line1, 
    "tle_line2": tle_line2, 
    "tle_timestamp": tle_timestamp
  },
  ...
}</code></pre><p>Using the satellite ID (NORAD ID) as a key allows us to track multiple satellites in the same stream, which will be useful and more scalable later on.</p><h4>Initial Offset</h4><p>Our class&#8217;s initialOffset function will initialize the offset data for new streams. Once a stream is started and successfully processed, the offsets will be written to Spark checkpoints and no longer use the initialOffset function. In our case, we haven't fetched the TLE data yet, so we&#8217;ll initialize the offset to an empty dict.</p><h4>Latest Offset</h4><pre><code>    def latestOffset(self) -&gt; dict:
        # Fetch the TLE data for each of the given satellites
        offsets = {}
        for norad_id in self.norad_id:
            tle = fetch_tle_data(norad_id, self.api_key)
            tle_line1 = tle["line1"]
            tle_line2 = tle["line2"]
            name = tle["name"]
            tle_timestamp = tle["date"]
            offsets[norad_id] = {"name": name, "tle_line1": tle_line1, "tle_line2": tle_line2, "tle_timestamp": tle_timestamp}
        return offsets
</code></pre><p>The latest offset should return the most recent offset available for our source. This is where we will actually make the HTTP request to fetch TLE data for each of the satellites being tracked.</p><p>As you recall from the class constructor, we allow the user to pass in multiple satellite IDs separated by commas.</p><h3>Partitions</h3><p>If offsets act like pointers for the source, then the class's partitions method involves taking Point A and Point B (start/end) and dividing them into one or more logical partitions.</p><p>The <strong>read()</strong> method will be invoked for each partition we return by our source.</p><p>In my case, I&#8217;d like to only emit new data when there is new/updated TLE information for the satellite, and I&#8217;d like to generate a partition for each satellite. Thus, I&#8217;ll model my partition like so:</p><pre><code>class SatellitePartition(InputPartition):
    def __init__(self, norad_id: str, name: str,tle_line1: str, tle_line2: str, tle_timestamp: datetime, size: int):
        self.norad_id = norad_id
        self.name = name
        self.tle_line1 = tle_line1
        self.tle_line2 = tle_line2
        self.tle_timestamp = tle_timestamp
        self.size = size</code></pre><p>Then, the partitions are generated by looping over the <strong>end</strong> offset (latest offset) and comparing it to the <strong>start</strong> offset (last offset that we processed). If there is new data, we&#8217;ll generate a partition.</p><pre><code>    def partitions(self, start: dict, end: dict):
        partitions = []
        for id, val in end.items():
            if id not in start:
                partitions.append(SatellitePartition(id, val["name"], val["tle_line1"], val["tle_line2"], val["tle_timestamp"], self.size))
            else:
                if start[id]["tle_line1"] != val["tle_line1"] or start[id]["tle_line2"] != val["tle_line2"] or start[id]["tle_timestamp"] != val["tle_timestamp"]:
                    partitions.append(SatellitePartition(id, val["name"], val["tle_line1"], val["tle_line2"], val["tle_timestamp"], self.size))
        return partitions</code></pre><h3>Read and Predict</h3><p>Next, we&#8217;d like to predict the position and velocity of the satellite using math. We&#8217;ll encapsulate the gory details in our custom source so users don&#8217;t have to look like this Charlie Day meme:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tpEo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tpEo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 424w, https://substackcdn.com/image/fetch/$s_!tpEo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 848w, https://substackcdn.com/image/fetch/$s_!tpEo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 1272w, https://substackcdn.com/image/fetch/$s_!tpEo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tpEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png" width="666" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:666,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:520912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/161239932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tpEo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 424w, https://substackcdn.com/image/fetch/$s_!tpEo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 848w, https://substackcdn.com/image/fetch/$s_!tpEo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 1272w, https://substackcdn.com/image/fetch/$s_!tpEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93ec9f-dbb9-44ea-8d3d-65263b96bb41_666x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>One of the most important notes about the <strong>read() </strong>method is that it should be <em>stateless</em>. You should not access class member variables or try to change the class&#8217;s state from within this function. This also means that any modules that need to be imported should be imported from within this function.</p><p>Our read function will take the TLE data and calculate the object's position (r) and velocity (v). To make things more interesting, we will calculate not only for the current time but also for N future intervals, therefore predicting where the satellite will be in the near future.</p><pre><code>
    def read(self, partition: SatellitePartition):
        from datetime import datetime, timedelta
        from sgp4.api import Satrec, jday, SGP4_ERRORS
        
        satellite = Satrec.twoline2rv(
            partition.tle_line1,
            partition.tle_line2
        )

        for i in range(partition.size):
            # SGP requires the timestamp to be in Julian date format
            ts = datetime.now() + (timedelta(**self.timedelta) * i)
            jd, fr = jday(
                ts.year,
                ts.month,
                ts.day,
                ts.hour,
                ts.minute,
                ts.second,
            )

            e, r, v = satellite.sgp4(jd, fr)
            if e != 0:
                e = SGP4_ERRORS.get(e, 'Unknown error')
            r = {"x": r[0], "y": r[1], "z": r[2]}
            v = {"x": v[0], "y": v[1], "z": v[2]}
            yield (
                ts,
                r,
                v,
                e,
                partition.norad_id,
                partition.name,
                datetime.strptime(partition.tle_timestamp, "%Y-%m-%dT%H:%M:%S%z"),
                partition.tle_line1,
                partition.tle_line2,
                jd,
                fr
            )</code></pre><h2>Finishing the Custom Data Source</h2><p>To put it all together for our custom data source, we now need to implement the <code>DataSource</code> class, which is the top-level wrapper. </p><pre><code>
class SatelliteDataSource(DataSource):
    """
    A data source for satellite data.
    """

    @classmethod
    def name(cls):
        return "satellite"

    def schema(self):
        return """
            ts TIMESTAMP,
            pos STRUCT&lt;
                x:DOUBLE,
                y:DOUBLE,
                z:DOUBLE
            &gt;,
            velocity STRUCT&lt;
                x:DOUBLE,
                y:DOUBLE,
                z:DOUBLE
            &gt;,
            e STRING,
            norad_id STRING,
            name STRING,
            tle_timestamp TIMESTAMP,
            tle_line1 STRING,
            tle_line2 STRING,
            jd DOUBLE,
            fr DOUBLE
        """

    def streamReader(self, schema: StructType):
        return SatelliteStreamReader(schema, self.options)
   
    </code></pre><p>Finally, we can register the source so Spark knows about it.</p><pre><code>spark.dataSource.register(SatelliteDataSource)</code></pre><h3>&#129321; Using the Custom Source</h3><p>To use the source, we write some familiar, idiomatic, Pyspark code:</p><pre><code>

df = (
    spark
        .readStream
        .format("satellite")
        .option("norad_id", "25544,33499")
        .option("size", "5")
        .option("timedelta", "minutes=1")
        .option("api_key", "DEMO_KEY")
        .load()
)

# Processing the stream
q = (
    df
        .writeStream
        .format("console")
        .outputMode("append")
        .trigger(once=True)
        .start()
        .awaitTermination()
)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F1I-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F1I-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 424w, https://substackcdn.com/image/fetch/$s_!F1I-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 848w, https://substackcdn.com/image/fetch/$s_!F1I-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!F1I-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F1I-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png" width="1456" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:680622,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/161239932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F1I-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 424w, https://substackcdn.com/image/fetch/$s_!F1I-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 848w, https://substackcdn.com/image/fetch/$s_!F1I-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!F1I-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a93f5f4-9c51-4d14-b2a2-281e445723d4_3312x1108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Processing</h2><p>Now that we have a streaming source that can read TLE data for multiple satellites, we can build processing use cases around this.</p><p>There&#8217;s all sorts of processing we could do against this TLE data: anomaly/drift detection, collision prediction, data visualization, etc.</p><p>Let&#8217;s examine a simple processing use case of transforming the data into <a href="https://en.wikipedia.org/wiki/Earth-centered,_Earth-fixed_coordinate_system">Earth-centered, Earth-fixed (ECEF)</a> coordinates and then visualizing the data in a 3D plot.</p><h3>Convert TEME to ECEF Coordinates</h3><p>The raw position coordinates we have for now are in the True Equator, Mean Equinox (TEME) coordinate system. We need to convert these into ECEF coordinates to plot them in relation to the Earth.</p><p>This is a classic use case for data transformations. Let&#8217;s write a helper function, which we can later use as a User-Defined Function (UDF).</p><pre><code>from astropy.time import Time
from astropy.coordinates import TEME, ITRS
from astropy import units

# Convert TEME to ECEF
def teme_to_ecef(rx, ry, rz, jd, fr):
    teme_position = TEME(
        rx * units.km, 
        ry * units.km, 
        rz * units.km, 
        obstime=Time(jd + fr, format="jd")
    )
    ecef_position = teme_position.transform_to(ITRS(obstime=teme_position.obstime))
    return ecef_position.x.value, ecef_position.y.value, ecef_position.z.value

</code></pre><h3>Data Visualization</h3><p>Static visualization can be done with <strong>matplotlib</strong>. We will use a 3D scatter plot to represent satellite positions and a wireframe sphere to represent the Earth.</p><p>Let&#8217;s put this in a helper function to be reused:</p><pre><code>from pyspark.sql import DataFrame
import matplotlib.cm as cm

def visualize(df: DataFrame):
    pDF = df.toPandas()
    pDF["color_id"] = pd.factorize(pDF["norad_id"])[0]

    # Generate a color map for unique norad_ids
    unique_ids = pDF["norad_id"].unique()
    id_to_color = {
        norad_id: cm.viridis(i / len(unique_ids))
        for i, norad_id in enumerate(unique_ids)
    }

    # Convert TEME to ECEF
    pDF["x_ecef"], pDF["y_ecef"], pDF["z_ecef"] = zip(*pDF.apply(lambda row: teme_to_ecef(
        row["pos"]["x"], row["pos"]["y"], row["pos"]["z"], row["jd"], row["fr"]
    ), axis=1))

    # Plotting
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.set_xlim([-7000, 7000])
    ax.set_ylim([-7000, 7000])
    ax.set_zlim([-7000, 7000])
    ax.set_xlabel('X (km)')
    ax.set_ylabel('Y (km)')
    ax.set_zlabel('Z (km)')

    # Plot the Earth
    u, v = np.mgrid[0:2*np.pi:100j, 0:np.pi:50j]
    x = 6371 * np.cos(u) * np.sin(v)
    y = 6371 * np.sin(u) * np.sin(v)
    z = 6371 * np.cos(v)
    ax.plot_wireframe(x, y, z, color='blue', alpha=0.1)

    # Plot the satellites with different colors for each one
    for norad_id, g in pDF.groupby('norad_id'):
        ax.scatter(
            g["x_ecef"], 
            g["y_ecef"], 
            g["z_ecef"], 
            color=id_to_color[norad_id],
            s=10,
            label=f"Satellite {norad_id}"
        )

    ax.legend()
    plt.show()
</code></pre><h1>&#128640; Putting It All Together</h1><p>This is where we put it all together! We&#8217;ll read-stream TLE data for three satellites simultaneously, compute their position and velocity, predict where they will be at each minute over the next 4 hours, convert those positions to an ECEF coordinate system, and finally plot these predictions on a 3D visualization.</p><pre><code>

df = (
    spark
        .readStream
        .format("satellite")
        .option("norad_id", "25544,33499,46362")
        .option("size", "240") # 240 predictions
        .option("timedelta", "minutes=1") # 1-minute predictions. 240 x 1 minute = 240 minutes = 4 hours
        .option("api_key", "DEMO_KEY")
        .load()
)

# Processing the stream
q = (
    df
        .writeStream
        .outputMode("append")
        .trigger(once=True)
        .foreachBatch(lambda df, batch_id: visualize(df))
        .start()
        .awaitTermination()
)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jt6t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jt6t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 424w, https://substackcdn.com/image/fetch/$s_!jt6t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 848w, https://substackcdn.com/image/fetch/$s_!jt6t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 1272w, https://substackcdn.com/image/fetch/$s_!jt6t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jt6t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png" width="671" height="640.4268585131895" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:417,&quot;resizeWidth&quot;:671,&quot;bytes&quot;:147091,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/161239932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jt6t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 424w, https://substackcdn.com/image/fetch/$s_!jt6t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 848w, https://substackcdn.com/image/fetch/$s_!jt6t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 1272w, https://substackcdn.com/image/fetch/$s_!jt6t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad96f092-418e-47cc-82b9-c7f30aa31fdd_417x398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Result and Final Thoughts</h1><p>Wow. Obviously, I am not the best at matplotlib, and this data visualization does leave some to be desired; however, it&#8217;s amazing how simple and idiomatic it is to build.</p><h2>&#9888;&#65039; Cautionary Comments</h2><p>Although it&#8217;s quite easy to build your own custom sources now with Spark 4 and newer Databricks Runtimes. I would caution you to think very carefully about the following implementation choices:</p><ul><li><p>What to use for offsets</p><ul><li><p>Offsets are a critical piece of spark streaming. If your upstream source has any knobs or attributes that make it durable or replayable, you should definitely consider leveraging these in your source.</p></li></ul></li><li><p>How partitions are created</p><ul><li><p>The number of partitions your source generates from the <strong>partitions() </strong>method will directly impact the amount of parallel processing that Spark can do with its executors. Avoid having too few partitions if your dataset is very large, and avoid over-partitioning, which can lead to a Large Number of Small Files Problem.</p></li><li><p>NOTE: alternatively, if you don&#8217;t need partitioning, you can implement the <strong>SimpleDataSourceStreamReader</strong> instead of <strong>DataSourceStreamReader</strong>.</p></li></ul></li><li><p>Avoid accessing state or mutating class members from the <strong>read()</strong> method. As a general rule of thumb, I try to avoid using the <strong>self</strong> keyword in this method.</p></li></ul><h2>Source Code</h2><p>All of the source code is available on <a href="https://gist.github.com/zcking/9fc46dce43d71a98a7effaefd2b15f4d">GitHub</a>. I have also implemented support for <strong>batch</strong> reading on this source code, so take a look if you are interested in more than just streaming.</p><p></p>]]></content:encoded></item><item><title><![CDATA[Build a MCP Server for AI Access to UniFi Networks (Goose or Claude)]]></title><description><![CDATA[In just a few minutes you can create your own local MCP server and control your UniFi network with AI agents like Goose and Claude.]]></description><link>https://www.makewithdata.tech/p/build-a-mcp-server-for-ai-access</link><guid isPermaLink="false">https://www.makewithdata.tech/p/build-a-mcp-server-for-ai-access</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Sun, 30 Mar 2025 19:30:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IIPZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>UniFi networking gear is great. It also exposes a very basic REST API, which still provides some neat information about our UniFi networks, devices, and connected clients.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IIPZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IIPZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 424w, https://substackcdn.com/image/fetch/$s_!IIPZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 848w, https://substackcdn.com/image/fetch/$s_!IIPZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!IIPZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IIPZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png" width="1456" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:339290,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/160201921?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IIPZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 424w, https://substackcdn.com/image/fetch/$s_!IIPZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 848w, https://substackcdn.com/image/fetch/$s_!IIPZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!IIPZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F203885f3-8cff-40d3-95f0-57427e83e20e_2904x1280.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I recently gave my home network a fresh makeover with UniFi hardware and came across this API. But I&#8217;m also a data &amp; AI maker nerd. </p><p>Today, in just a few minutes, we&#8217;ll see how to build our own MCP server implementation that runs locally on your machine and uses the local UniFi API. This will let you talk to your network using modern AI Agents like Goose and Claude to unlock natural language access.</p><h2>Setup / Prerequisites</h2><ol><li><p>Create an API key for accessing the UniFi Network API</p><ol><li><p>Go to your UniFi console at <a href="https://unifi.ui.com">https://unifi.ui.com</a></p></li><li><p>Go to <strong>Settings &#187; Control Plane &#187; Integrations</strong> and click <em>Create API Key</em></p></li></ol></li><li><p>Install <a href="https://docs.astral.sh/uv/getting-started/">uv</a> which we&#8217;ll use for managing the Python project.</p></li></ol><p></p><h2>Create Project Structure</h2><p>Next, we&#8217;ll initialize the project/codebase with a few simple commands:</p><pre><code>uv init mcp-server-unifi
cd mcp-server-unifi
uv venv
uv add "mcp[cli]" httpx requests</code></pre><p>uv will also create a `hello.py` file which we don&#8217;t need so you can delete that.</p><pre><code>rm hello.py</code></pre><p></p><h2>Implementing the MCP Server</h2><p>Let&#8217;s begin the implementation by importing some needed modules and defining environment variables for configuration.</p><pre><code># File: main.py

from typing import Any, List, Dict, Optional
import os
from mcp.server.fastmcp import FastMCP
import requests

# Configuration
UNIFI_API_KEY = os.getenv("UNIFI_API_KEY", "YOUR_API_KEY_HERE")
UNIFI_GATEWAY_HOST = os.getenv("UNIFI_GATEWAY_HOST", "192.168.1.1")
UNIFI_GATEWAY_PORT = os.getenv("UNIFI_GATEWAY_PORT", "443")
UNIFI_GATEWAY_BASE_URL = f"https://{UNIFI_GATEWAY_HOST}:{UNIFI_GATEWAY_PORT}/proxy/network/integration"

# Initialize FastMCP server
mcp = FastMCP("unifi")</code></pre><p>When we run our server later, this will make it easy to configure if you have a gateway.</p><h3>Helper Functions</h3><p>This entire MCP server is simply encapsulating the REST API, which means lots of duplicate code. To keep things DRY let&#8217;s add a helper function for making the API calls.</p><pre><code>def unifi_request(path: str, method: str, params: Optional[Dict[str, Any]] = None, data: Optional[Dict[str, Any]] = None):
    """
    Make a request to the Unifi API

    Args:
        path (str): The path to the API endpoint
        method (str): The HTTP method to use
        data (Optional[Dict[str, Any]], optional): The data to send to the API. Defaults to None.

    Returns:
        dict: The response JSON from the API
    """
    url = f"{UNIFI_GATEWAY_BASE_URL}/{path}"
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": UNIFI_API_KEY,
    }
    response = requests.request(method, url, headers=headers, params=params, data=data, verify=False)
    return response.json()
</code></pre><h3>Defining Resources</h3><p>Now, the fun part! We are going to define some <a href="https://modelcontextprotocol.io/docs/concepts/resources">resources</a> for our server. In MCP, resources allow the server to expose data and content that can be read by the clients (like Claude) and used as context for LLMs.</p><p>Each resource has a URI that follows this format:</p><pre><code>[protocol]://[host]/[path]</code></pre><p>Using the UniFi Network API, we can read data about the UniFi sites, which are at the top of our UniFi hierarchy. Within a site, we can also read data about adopted devices, connected clients, etc. </p><p>So, for the following endpoint, we could come up with a resource and URI like <code>sites://{site_id}/devices</code></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eZBY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eZBY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 424w, https://substackcdn.com/image/fetch/$s_!eZBY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 848w, https://substackcdn.com/image/fetch/$s_!eZBY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 1272w, https://substackcdn.com/image/fetch/$s_!eZBY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eZBY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png" width="1456" height="733" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:733,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:232112,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/160201921?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eZBY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 424w, https://substackcdn.com/image/fetch/$s_!eZBY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 848w, https://substackcdn.com/image/fetch/$s_!eZBY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 1272w, https://substackcdn.com/image/fetch/$s_!eZBY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faaa18982-bd50-401f-9cb0-e1916215f1a2_2316x1166.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To implement resources in our Python code, we&#8217;ll make use of the server&#8217;s decorators like <code>@mcp.tool()</code> and <code>@mcp.resource().</code></p><pre><code>
@mcp.resource("sites://")
async def list_sites() -&gt; List[Dict[str, Any]]:
    """List all sites in the Unifi controller"""
    sites = []
    params = {"limit": 200, "offset": 0}

    while True:
        resp = unifi_request("/v1/sites", "GET", params=params)
        sites.extend(resp["data"])
        if resp["count"] != resp["limit"] or resp["totalCount"] &lt;= len(sites):
            break
        params["offset"] += resp["limit"]

    return sites


@mcp.resource("sites://{site_id}/devices")
async def list_devices(site_id: str) -&gt; List[Dict[str, Any]]:
    """
    List all devices in a specific Unifi site
    
    Args:
        site_id (str): The ID of the site to list devices for
        
    Returns:
        List[Dict[str, Any]]: List of devices in the site
    """
    devices = []
    params = {"limit": 200, "offset": 0, "site_id": site_id}
    
    while True:
        resp = unifi_request(f"/v1/sites/{site_id}/devices", "GET", params=params)
        devices.extend(resp["data"])
        if resp["count"] != resp["limit"] or resp["totalCount"] &lt;= len(devices):
            break
        params["offset"] += resp["limit"]
    
    return devices</code></pre><h2>Running the MCP Dev Server</h2><p>Finally, we&#8217;re ready to run our MCP server! We just need an entrypoint in the code, so don&#8217;t forget to add:</p><pre><code>if __name__ == "__main__":
    # Initialize and run the server
    mcp.run(transport='stdio')
</code></pre><p>You can run your MCP server standalone just to test that it starts up successfully, with <code>uv run mcp dev main.py</code>. This will also start the <a href="https://modelcontextprotocol.io/docs/tools/inspector">MCP Inspector</a> at <a href="http://localhost:5173">http://localhost:5173</a> which is great for testing and debugging.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ye50!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ye50!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 424w, https://substackcdn.com/image/fetch/$s_!Ye50!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 848w, https://substackcdn.com/image/fetch/$s_!Ye50!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 1272w, https://substackcdn.com/image/fetch/$s_!Ye50!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ye50!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png" width="1456" height="1208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1208,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:960114,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/160201921?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ye50!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 424w, https://substackcdn.com/image/fetch/$s_!Ye50!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 848w, https://substackcdn.com/image/fetch/$s_!Ye50!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 1272w, https://substackcdn.com/image/fetch/$s_!Ye50!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b237be-5c3f-46f5-b55a-091a40cdfd5d_2622x2176.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Configuring for Goose</h2><p>If you&#8217;d like to use Block&#8217;s <a href="https://block.github.io/goose/">Goose</a> AI Agent, open Goose and go to <strong>Settings &#187; Extensions &#187; Add custom extension</strong>.</p><p>In the form that prompts you, fill in the following details, and be sure to change &#8220;<em>username&#8221;</em> to your own username as used by your machine:</p><ul><li><p><strong>ID</strong>: unifi</p></li><li><p><strong>Name</strong>:<strong> </strong>unifi</p></li><li><p><strong>Description</strong>: Get information about your UniFi network</p></li><li><p><strong>Command</strong>: </p><ul><li><p>/Users/username/.local/bin/uv --directory /Users/username/path/to/mcp-server-unifi run main.py</p></li></ul></li><li><p><strong>Environment Variables:</strong></p><ul><li><p>UNIFI_API_KEY: input your API key</p></li></ul></li></ul><p>Click <em>Add</em> to save the changes and you&#8217;ll have successfully added the MCP server to Goose for local use!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Znja!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Znja!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 424w, https://substackcdn.com/image/fetch/$s_!Znja!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 848w, https://substackcdn.com/image/fetch/$s_!Znja!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 1272w, https://substackcdn.com/image/fetch/$s_!Znja!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Znja!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png" width="1456" height="1904" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1904,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:829432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/160201921?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Znja!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 424w, https://substackcdn.com/image/fetch/$s_!Znja!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 848w, https://substackcdn.com/image/fetch/$s_!Znja!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 1272w, https://substackcdn.com/image/fetch/$s_!Znja!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec05cf-6ec7-4721-a671-1629ec79a272_1724x2254.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>No need to restart Goose, go ahead and try it out:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2-e4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2-e4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 424w, https://substackcdn.com/image/fetch/$s_!2-e4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 848w, https://substackcdn.com/image/fetch/$s_!2-e4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 1272w, https://substackcdn.com/image/fetch/$s_!2-e4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2-e4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png" width="1456" height="1904" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1904,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:610642,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/160201921?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2-e4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 424w, https://substackcdn.com/image/fetch/$s_!2-e4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 848w, https://substackcdn.com/image/fetch/$s_!2-e4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 1272w, https://substackcdn.com/image/fetch/$s_!2-e4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b364621-4dd1-46ff-a825-a522f6396a28_1724x2254.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MN1D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MN1D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 424w, https://substackcdn.com/image/fetch/$s_!MN1D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 848w, https://substackcdn.com/image/fetch/$s_!MN1D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 1272w, https://substackcdn.com/image/fetch/$s_!MN1D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MN1D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png" width="1456" height="1904" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1904,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:629043,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.makewithdata.tech/i/160201921?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MN1D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 424w, https://substackcdn.com/image/fetch/$s_!MN1D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 848w, https://substackcdn.com/image/fetch/$s_!MN1D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 1272w, https://substackcdn.com/image/fetch/$s_!MN1D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8294337c-49ca-426d-a9a3-80fb30957ce3_1724x2254.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Configuring for Claude Desktop</h2><p>If you&#8217;d like to configure your MCP for use with Claude Desktop app, open Claude and go to <strong>Settings &#187; Developer &#187; Edit Config</strong>.</p><p>This will open a <strong>claude_desktop_config.json</strong> file in a text editor. If it opens your file explorer / Finder application, just open the file it has selected automatically.</p><p>Paste the following into this configuration file, and be sure to change &#8220;<em>username&#8221;</em> to your own username as used by your machine.</p><pre><code>{
    "mcpServers": {
        "unifi": {
            "command": "/Users/username/.local/bin/uv",
            "args": [
                "--directory",
                "/Users/username/path/to/mcp-server-unifi",
                "run",
                "main.py"
            ]
        }
    }
}</code></pre><p>Save your changes to the file, then restart the Claude Desktop application and give it a try!</p><h2>Conclusion</h2><p>Hopefully, you&#8217;ve enjoyed yet another MCP blog. Yes, MCP is very popular right now. While it may not be as &#8220;game-changing&#8221; as tech influencers will claim, it&#8217;s certainly very easy to use and build upon&#8212;and standards are almost always a good thing!</p><p>The source code for this is available on GitHub (although not accepting PRs as it is a demo only): <a href="https://github.com/zcking/mcp-server-unifi">https://github.com/zcking/mcp-server-unifi</a>.</p><p>I want to hear from you though&#8230;what do you think about MCP? Can you think of anything you&#8217;d like an MCP server for but doesn&#8217;t exist yet? Tip: check out <a href="https://github.com/modelcontextprotocol/servers">this growing awesome-list</a> of MCP servers and subscribe to MakeWithData for more!</p><p></p>]]></content:encoded></item><item><title><![CDATA[Is Haystack better than LangChain?]]></title><description><![CDATA[A battle of LLM frameworks, which is better for your next RAG pipeline? Can we figure it out with the help of Chuck Norris?]]></description><link>https://www.makewithdata.tech/p/is-haystack-better-than-langchain</link><guid isPermaLink="false">https://www.makewithdata.tech/p/is-haystack-better-than-langchain</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Wed, 08 Jan 2025 15:02:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gYGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gYGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gYGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gYGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gYGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gYGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gYGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:270343,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gYGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gYGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gYGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gYGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501fd41e-bc85-42e8-b491-fbe1486001ba_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As the adoption of large language models (LLMs) like GPT-4, o1, Llama, and Gemini increases, developers are turning to frameworks and higher abstractions to simplify building Gen AI applications powered by these models.</p><p>Unless you were asleep the past 2 years, you&#8217;ve probably at least heard of <a href="https://langchain.com/">LangChain</a>, a modular framework for building applications with LLMs, including RAG, chatbots, agentic workflows, and more. LangChain has grown extremely popular, with nearly <a href="https://github.com/langchain-ai/langchain">100k stars on GitHub</a> at the time of this writing &#129321;</p><p>Today, though, I want to discuss another open-source framework,&nbsp;<a href="https://haystack.deepset.ai/">Haystack</a>, developed by a Berlin company,&nbsp;<a href="https://www.deepset.ai/">deepset</a>.</p><div><hr></div><h2>First Thoughts</h2><p>My first opinion of Haystack was that it is extremely simple. Now remember that statement because, in a moment, we&#8217;ll see how that&nbsp;<em>could</em>&nbsp;be Haystack&#8217;s kryptonite, too. </p><p>For context, I have been writing Python code for around a decade, so I had no syntax struggles with either framework. However, the semantics in Haystack had a much smaller learning curve, as everything boils down to either a&nbsp;<strong>component</strong>&nbsp;or a&nbsp;<strong>pipeline</strong>. This is immediately pointed out in their&nbsp;<a href="https://haystack.deepset.ai/overview/intro">overview</a>.</p><p>After a couple of hours reading the &#8220;Getting Started&#8221; docs and following a simple RAG example, I successfully implemented my own custom component to retrieve data from a REST API and integrate it with a prompt builder and LLM. Creating custom components was very Pythonic and honestly made sense even if I weren&#8217;t messing with LLMs.</p><p>That said, as I began trying to make more <em>dynamic</em> patterns like an agentic workflow or multi-modal applications, it felt more limiting&#8212;the pipelines were easy to adapt to linear workflows, but more organic, complex flows didn&#8217;t feel as intuitive. This may be just my lack of experience.</p><h2>Comparing Customization</h2><p>Pictures speak 1,000 words; code speaks 10,000 &#128517; Let&#8217;s see how they compare by writing a simple custom retriever in LangChain and then in Haystack:</p><h3>LangChain Custom Retreiver</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qYSO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qYSO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 424w, https://substackcdn.com/image/fetch/$s_!qYSO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 848w, https://substackcdn.com/image/fetch/$s_!qYSO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 1272w, https://substackcdn.com/image/fetch/$s_!qYSO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qYSO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png" width="728" height="412.50937808489635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50127dbe-5949-4640-be47-b0a772abc982_1013x574.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a429e23e-62c2-4a63-b269-97ed8974512a_1013x574.png&quot;,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:574,&quot;width&quot;:1013,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:115807,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qYSO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 424w, https://substackcdn.com/image/fetch/$s_!qYSO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 848w, https://substackcdn.com/image/fetch/$s_!qYSO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 1272w, https://substackcdn.com/image/fetch/$s_!qYSO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50127dbe-5949-4640-be47-b0a772abc982_1013x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is pretty straightforward, although we&#8217;re clearly just hardcoding fake documents and doing one of the most naive and inefficient searches possible.</p><p>With LangChain, we just inherit the <strong>BaseRetreiver</strong> abstract class and implement the <code>_get_relevant_documents()</code> method. The class uses <a href="https://docs.pydantic.dev/latest/">Pydantic</a> to model the class&#8217;s required inputs, which we only define one: <code>documents</code>. A more realistic retriever might implement an embedding algorithm, query a vector database, call a REST API, or even use an LLM (e.g., generate relevant questions based on the user&#8217;s query).</p><h3>Haystack Custom Retriever</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8FUN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8FUN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 424w, https://substackcdn.com/image/fetch/$s_!8FUN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 848w, https://substackcdn.com/image/fetch/$s_!8FUN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 1272w, https://substackcdn.com/image/fetch/$s_!8FUN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8FUN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png" width="762" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:762,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123523,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8FUN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 424w, https://substackcdn.com/image/fetch/$s_!8FUN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 848w, https://substackcdn.com/image/fetch/$s_!8FUN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 1272w, https://substackcdn.com/image/fetch/$s_!8FUN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcaa677e-7b9f-4be1-86bf-8a9408e4ad2c_762x681.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Haystack looks pretty similar, but I took the liberty of using the <strong>InMemoryDocumentStore</strong>, which has <a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25</a>, a probabilistic information retrieval model that basically uses a bag-of-words and term frequency.</p><p>With Haystack, the component is a Python class marked by the <code>@component</code> decorator, and a <code>run() </code>method<code>. </code></p><p>These retrievers are not usually invoked directly; they are often part of a chain or pipeline. Let&#8217;s see another example that uses a component in a pipeline: custom prompts.</p><h3>Haystack Pipelines</h3><p>Okay, let&#8217;s create a more&#8230;fun, custom component. This calls the Chuck Norris joke API and returns the joke that we will then use in a pipeline to generate images &#129315;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HusZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HusZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 424w, https://substackcdn.com/image/fetch/$s_!HusZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 848w, https://substackcdn.com/image/fetch/$s_!HusZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 1272w, https://substackcdn.com/image/fetch/$s_!HusZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HusZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png" width="709" height="701.9452736318408" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:603,&quot;resizeWidth&quot;:709,&quot;bytes&quot;:92965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HusZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 424w, https://substackcdn.com/image/fetch/$s_!HusZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 848w, https://substackcdn.com/image/fetch/$s_!HusZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 1272w, https://substackcdn.com/image/fetch/$s_!HusZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67d0a6aa-567f-43ba-9bdc-4e9251347ae1_603x597.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Next, you create a <strong>Pipeline</strong>, add the components, and connect them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vLtW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vLtW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 424w, https://substackcdn.com/image/fetch/$s_!vLtW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 848w, https://substackcdn.com/image/fetch/$s_!vLtW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 1272w, https://substackcdn.com/image/fetch/$s_!vLtW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vLtW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png" width="718" height="619.3021148036254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:571,&quot;width&quot;:662,&quot;resizeWidth&quot;:718,&quot;bytes&quot;:94435,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vLtW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 424w, https://substackcdn.com/image/fetch/$s_!vLtW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 848w, https://substackcdn.com/image/fetch/$s_!vLtW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 1272w, https://substackcdn.com/image/fetch/$s_!vLtW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37307411-6cb2-4b1e-ba7c-91abf2fe66d1_662x571.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, this is very straightforward with Haystack. You register each component with a name and then wire them together, essentially connecting each component&#8217;s inputs and outputs.</p><p>In this example, I am pipelining the Chuck Norris API &#8594; PromptBuilder &#8594; DALL-E image generator.</p><p>Notice that I don&#8217;t connect anything in the pipeline to the&nbsp;<strong>style</strong>&nbsp;template variable; instead, I leave it empty to pass that in via the command-line arguments and change the style as I like.</p><p>Running this pipeline gave me some great laughs, and even if you hate Haystack, I recommend giving it a shot for your own fun.</p><blockquote><p><strong>Joke: </strong><em>When Chuck Norris break the build, you can't fix it, because there is not a single line of code left</em><strong>.</strong></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bifn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bifn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!bifn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!bifn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!bifn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bifn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bifn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!bifn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!bifn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!bifn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb54a723f-df50-4734-9371-0dc1c9972532_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>LangChain Chains</h3><p>If I were to write this in LangChain, the simplest way would be to write custom tools with the <strong>@tool</strong> decorator, or you could use classes and subclass them from the <strong>BaseTool</strong> class provided by LangChain. Afterward, we would construct a chain, such as the <strong><a href="https://python.langchain.com/api_reference/langchain/chains/langchain.chains.sequential.SequentialChain.html">SequentialChain</a></strong>, and provide it with a list of our tools.</p><p>I found that the process of creating a PromptTemplate in LangChain and using it between the joke generator and the image generate was strangely verbose; however, this makes sense given how LangChain is designed to be more open and caters well if you&#8217;re building agents / agentic workflows.</p><h2>Final Comparison</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6SrU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6SrU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 424w, https://substackcdn.com/image/fetch/$s_!6SrU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 848w, https://substackcdn.com/image/fetch/$s_!6SrU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 1272w, https://substackcdn.com/image/fetch/$s_!6SrU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6SrU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png" width="602" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:602,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6SrU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 424w, https://substackcdn.com/image/fetch/$s_!6SrU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 848w, https://substackcdn.com/image/fetch/$s_!6SrU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 1272w, https://substackcdn.com/image/fetch/$s_!6SrU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81085ce2-7b25-48b1-8836-a04529ed67a8_602x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Both are very similar, but I found Haystack simpler and with an easier learning curve. Haystack would be great if you just want to get up and running fast, building applications such as RAG, Q&amp;A, search, or general sequential workflows. </p><p>LangChain, on the other hand, has a steeper learning curve but is more robust. LangChain is better for agentic workflows and advanced concepts such as tool calling, ReAct (Reasoning and Acting), and chat memory. LangChain probably has more community tools available now, although Haystack also has <a href="https://haystack.deepset.ai/integrations">tons</a>.</p><p>At the end of the day, you shouldn&#8217;t listen to the hype about specific tools; try Haystack and LangChain out and see which one fits your needs and development style better!</p><blockquote><p>&#128680; Hey! I&#8217;m currently working on a personal health and fitness Haystack application that uses data from apps like <strong>MyFitnessPal</strong>, <strong>Hevy</strong>, and leading LLMs to give personal feedback and recommendations. If you&#8217;d like to see a project-based post for this in the future, be sure to subscribe!</p></blockquote><div class="poll-embed" data-attrs="{&quot;id&quot;:255097}" data-component-name="PollToDOM"></div><blockquote><p><em>MakeWithData is free today. But if you enjoyed this post, you can tell MakeWithData that their writing is valuable by pledging a future subscription. You won't be charged unless they enable payments.</em></p></blockquote>]]></content:encoded></item><item><title><![CDATA[7 Ways to use AI for Data Products in 2025]]></title><description><![CDATA[AI isn't going anywhere! This year, learn how to stay ahead of the competition and streamline business by integrating AI into your data products.]]></description><link>https://www.makewithdata.tech/p/7-ways-to-use-ai-for-data-products</link><guid isPermaLink="false">https://www.makewithdata.tech/p/7-ways-to-use-ai-for-data-products</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Mon, 06 Jan 2025 13:01:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/80254b45-5799-4ef8-abcd-de28929f2da3_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If 2024 has taught us anything, it&#8217;s that AI (generative or not) finds a home in everything we touch and use. Data is the root of AI's power, so, of course, data products are among those being enriched.</p><p>That said, much of the trends and startups were pure hype, and you either drank the Kool-Aid or played it slow with the risk of &#8220;falling behind&#8221; (seemingly, at least). The difference in 2025 will be that generative AI tools are currently in the &#8220;trough of disillusionment.&#8221; This means we can expect to see these tools mature and some of the more gimmicky offerings that are just out for a quick buck to die down.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3Wf4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Wf4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3Wf4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3Wf4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3Wf4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Wf4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg" width="800" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Trough Of Disillusionment And Four Outliers On The ...&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Trough Of Disillusionment And Four Outliers On The ..." title="The Trough Of Disillusionment And Four Outliers On The ..." srcset="https://substackcdn.com/image/fetch/$s_!3Wf4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3Wf4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3Wf4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3Wf4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b27db9-7350-4bac-bb28-3f983b1d2af9_800x600.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Gartner Hype Cycle. Source: https://www.forbes.com/sites/johnwerner/2024/07/18/the-trough-of-disillusionment-and-four-outliers-on-the-gartner-hype-cycle/</figcaption></figure></div><p>So, as data engineers and stewards of our organization&#8217;s data, we&nbsp;<strong>must</strong>&nbsp;prepare for the pragmatic integration of AI into our data products this upcoming year. I&#8217;ll share several use cases in categories such as:</p><ul><li><p>Developer Experience</p></li><li><p>Data Processing</p></li><li><p>Data Consumption</p></li><li><p>New Products</p></li></ul><h1>Predictive Modeling and Forecasting</h1><p>Data engineers can integrate AI models into data pipelines to generate predictions and forecasts. For example, we may predict customer churn, forecast sales or website trends, and identify anomalies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wa-A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wa-A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 424w, https://substackcdn.com/image/fetch/$s_!wa-A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 848w, https://substackcdn.com/image/fetch/$s_!wa-A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 1272w, https://substackcdn.com/image/fetch/$s_!wa-A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wa-A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png" width="686" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:686,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wa-A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 424w, https://substackcdn.com/image/fetch/$s_!wa-A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 848w, https://substackcdn.com/image/fetch/$s_!wa-A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 1272w, https://substackcdn.com/image/fetch/$s_!wa-A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a6576b3-08bb-4298-942f-dbb452619dae_686x590.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here are some tools you can use to quickly forecast on your data:</p><ul><li><p><strong><a href="https://facebook.github.io/prophet/">Prophet (by Meta)</a>:</strong> Prophet is a time series forecasting library designed for handling data with seasonality. It is particularly effective for forecasting business metrics, such as sales, website traffic, and resource usage. </p></li><li><p><strong><a href="https://www.databricks.com/blog/2020/01/27/time-series-forecasting-prophet-spark.html">Apache Spark MLlib</a> (+ Prophet):</strong> Spark MLlib is a scalable machine learning library that is integrated with Apache Spark. It provides various classification, regression, clustering, and collaborative filtering algorithms, making it suitable for building predictive models on large datasets.</p></li><li><p><strong><a href="https://docs.databricks.com/en/sql/language-manual/functions/ai_forecast.html">Databricks </a></strong><em><strong><a href="https://docs.databricks.com/en/sql/language-manual/functions/ai_forecast.html">ai_forecast()</a></strong></em><strong><a href="https://docs.databricks.com/en/sql/language-manual/functions/ai_forecast.html"> function</a>: </strong><code>ai_forecast()</code> is a table-valued function in Databricks designed to extrapolate time series data into the future. Literally just a SQL function&#8212;it can&#8217;t get any easier than this.</p></li></ul><div class="pullquote"><p><strong>Why: </strong>Supports proactive decision-making and detects problems sooner.</p></div><h2>Retrieval Augment Generation (RAG) on your Data</h2><p>Your data is your company&#8217;s most valuable asset&#8212;period, full stop. Many companies, even outside the tech sector, have realized this value in recent years. Gen AI has expanded this value further with the Retrieval Augmented Generation (RAG) concept, in which a large language model (LLM) dynamically retrieves data from a vector database to provide more contextual and correct answers.</p><p>As data engineers, we must start thinking about the most valuable data for this purpose. RAG isn&#8217;t limited to structured data&#8212;in fact, it is typically&nbsp;<em>more valuable</em>&nbsp;in unstructured data, which is traditionally more difficult to search or parse. For example, all of the following could be great candidates you could create &#8220;embeddings&#8221; for and store in a vector database:</p><ul><li><p>PDF Documents (tip: check out <a href="https://github.com/StabRise/spark-pdf">this PDF reader</a> for Apache Spark!)</p></li><li><p>Images/Pictures</p></li><li><p>Audio Files</p></li><li><p>Logs and other Text Files</p></li><li><p>Excel/Powerpoint/Word Documents</p></li></ul><p>Of course, AI can also access your structured and semi-structured data, but you may leverage concepts like AI Tools or Function Calling to achieve that.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y6lr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y6lr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 424w, https://substackcdn.com/image/fetch/$s_!y6lr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 848w, https://substackcdn.com/image/fetch/$s_!y6lr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 1272w, https://substackcdn.com/image/fetch/$s_!y6lr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y6lr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png" width="818" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/304ca372-6f86-471f-ad3f-35f327d10171_818x578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:818,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y6lr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 424w, https://substackcdn.com/image/fetch/$s_!y6lr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 848w, https://substackcdn.com/image/fetch/$s_!y6lr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 1272w, https://substackcdn.com/image/fetch/$s_!y6lr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304ca372-6f86-471f-ad3f-35f327d10171_818x578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For RAG, here are some tools that can be used to build RAG pipelines, create and manage vector data storage, and integrate RAG with your applications:</p><ul><li><p><a href="https://haystack.deepset.ai/j">Haystack</a>: an open-source framework for building production-ready <em>LLM applications</em>, <em>RAG pipelines,</em> and <em>state-of-the-art search systems</em> that work intelligently over large document collections. Of course, there is LangChain, which you should also check out.</p></li><li><p><a href="https://docs.databricks.com/en/generative-ai/retrieval-augmented-generation.html">Databricks</a>: Databricks offers building blocks for RAG, such as its <a href="https://www.databricks.com/product/machine-learning/vector-search">Vector Search</a>, serverless model serving endpoints, Tools or function-calling, AI agent evaluation, and various LLMs.</p></li><li><p><a href="https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview">Azure AI Search</a>: managed Azure cloud service that enables efficient searching of unstructured data, offering features like full-text search, autocomplete, and semantic ranking across various data formats and languages.</p></li><li><p><a href="https://aws.amazon.com/bedrock/">Amazon Bedrock</a>: managed AWS cloud service supporting a variety of Foundation Models (FM) and facilitating data vectorization and vector storage. </p></li></ul><div class="pullquote"><p><strong>Why: </strong>RAG opens the door to several new ways of leveraging your data alongside Large Language Models (LLMs), bringing your private data into context and improving accuracy, recency, and data lineage without requiring you to train your own model.</p></div><h2>AI for Data Quality</h2><p>Whether you&#8217;re already experienced in managing data quality or are just tackling issues as they come, AI offers clear opportunities for enhancement.</p><p>One of my favorite use cases here is using AI to automatically detect and obfuscate sensitive information automatically. Databricks has a great solution accelerator for this use case with the Protected Health Information (PHI): <a href="https://www.databricks.com/solutions/accelerators/automated-phi-removal">https://www.databricks.com/solutions/accelerators/automated-phi-removal</a></p><p>I&#8217;m excited about using similar features from Databricks, such as&nbsp;<a href="https://docs.databricks.com/en/sql/language-manual/functions/ai_classify.html">ai_classify()</a>&nbsp;and&nbsp;<a href="https://docs.databricks.com/en/sql/language-manual/functions/ai_mask.html">ai_mask()</a>. These simple SQL functions use generative AI to classify input text according to the labels you provide or mask things like PII, respectively.</p><p>Not to make it all about Databricks, but I have to give them kudos once more for <a href="https://www.databricks.com/product/machine-learning/lakehouse-monitoring">Lakehouse Monitoring</a>. Lakehouse Monitoring helps you monitor your data integrity, how your data changes over time, statistical distribution, drift detection, and more. More importantly, as we build AI and use AI more ourselves, we need to ensure that <em>AI</em> has appropriate data quality, and Lakehouse Monitoring covers this by tracking model performance/accuracy with metrics like F1 score.</p><div class="pullquote"><p><strong>Why: </strong>AI automates crucial and tedious tasks such as data monitoring, identification, tagging, and redaction. This identifies problems and secures your sensitive data quicker and more reliably.</p></div><h2>AI for Business Intelligence - cut out the middle man!</h2><p>BI is usually a back-and-forth process for data analysts and analytics engineers to maintain SQL queries, data visualizations, and business stakeholders' constant intake of different analytical questions. Wouldn&#8217;t it be nice if you could just ask the questions and create those pretty bar charts on the fly?</p><p>This is one of the most interesting AI use cases for data practitioners, and I expect we&#8217;ll see many more in 2025. Major players, such as Databricks with its&nbsp;<a href="https://www.databricks.com/product/ai-bi/genie">AI/BI Genie</a>&nbsp;and Microsoft&#8217;s&nbsp;<a href="https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-introduction">Copilot for Power BI, have already begun successfully demonstrating it</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c17d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c17d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 424w, https://substackcdn.com/image/fetch/$s_!c17d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 848w, https://substackcdn.com/image/fetch/$s_!c17d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 1272w, https://substackcdn.com/image/fetch/$s_!c17d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c17d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png" width="758" height="638" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:638,&quot;width&quot;:758,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c17d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 424w, https://substackcdn.com/image/fetch/$s_!c17d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 848w, https://substackcdn.com/image/fetch/$s_!c17d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 1272w, https://substackcdn.com/image/fetch/$s_!c17d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e1d4928-12ef-4482-ab67-62a65743d631_758x638.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The key to this being successful will come down to the process in terms of &#8220;who&#8221; manages the tooling and &#8220;how&#8221; it is administered. Some tools, like Databricks Genie, require the setup of a &#8220;space&#8221; that is configured with data sources (tables), instructions, and example prompts/queries&#8212;these are crucial to get right, and you&#8217;ll get out of it what you put into it (effort).</p><div class="pullquote"><p><strong>Why: </strong>Business stakeholders get answers quicker and with personalized visualizations, and BI engineers can focus on managing the underlying data, its quality, prompts, and feedback loops.</p></div><h2>Creating New Data Products with AI</h2><p>While AI has many uses for increasing our productivity and internal workflows, I believe it also opens the door for us to generate new types of data products. The marketplace concept is already quite popular with cloud providers like AWS, Azure, GCP, and even leading data platforms like Databricks; these marketplaces have always been hosts of a myriad of datasets and applications, so if your organization has data that it would like to monetize it could do so easily.</p><p>AI has expanded this market with the opportunity to create AI applications or turn-key components for AI integration. Here are a few examples I can think of, some of which are already being seen:</p><ul><li><p>Pre-computed Vector databases of common datasets</p></li><li><p>Fine-tuned small language models (SLMs) for targeted use cases</p></li><li><p>Data integrations for proprietary SaaS vendors: data retrievers, automated vector synchronization, function calling libraries.</p></li><li><p>Self-service for enterprise customers to access their own data via technology like <a href="https://delta.io/sharing/">Delta Sharing</a>, as more customers seek to utilize their data from vendors in their own AI.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AqUp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AqUp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 424w, https://substackcdn.com/image/fetch/$s_!AqUp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 848w, https://substackcdn.com/image/fetch/$s_!AqUp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 1272w, https://substackcdn.com/image/fetch/$s_!AqUp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AqUp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png" width="650" height="515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:515,&quot;width&quot;:650,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AqUp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 424w, https://substackcdn.com/image/fetch/$s_!AqUp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 848w, https://substackcdn.com/image/fetch/$s_!AqUp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 1272w, https://substackcdn.com/image/fetch/$s_!AqUp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351724fc-082c-4c4b-bbe7-964dfcc60cc1_650x515.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p><strong>Why: </strong>Maximize the value of your data by capitalizing on its practicalness for AI and leveraging marketplaces to deliver and distribute.</p></div><h2>Coding Assistants</h2><p>AI coding assistants were some of the very first to take off and show significant value to engineering teams. That also makes them some of the most mature already, so if you&#8217;re not using one already&#8212;stop waiting.</p><p>Mileage may vary, but the ones I recommend the most:</p><ul><li><p><strong>Databricks Assistant: </strong>for all coding inside Databricks. Plus, it&#8217;s free!</p></li><li><p><strong>GitHub Copilot: </strong>Although I read many negative comments about Copilot compared to some more recent startups, it&#8217;s still one of the best all-purpose coding assistants I have used, regardless of language. It also supports settings like opting out of your completions being used for product improvements (training) and disabling <a href="https://docs.github.com/en/copilot/managing-copilot/managing-copilot-as-an-individual-subscriber/managing-copilot-policies-as-an-individual-subscriber#enabling-or-disabling-suggestions-matching-public-code">matching public code</a>.</p></li><li><p><strong>Amazon Q Developer: </strong>Perfect if your organization runs tight vendor security and only allows AWS-native services.</p></li></ul><div class="pullquote"><p><strong>Why: </strong>Coding assistants produce code more quickly. No, they&#8217;re not replacing developers&#8212;they&#8217;re speeding up the more tedious aspects of coding so we humans can focus on the bigger tasks!</p></div><h2>Fixing Errors</h2><p>Interpreting error stacktraces is a vital skill for engineers when troubleshooting issues. Still, even experienced engineers may need several minutes to parse through the error, search solutions such as StackOverflow, and figure out how to apply the fix to their code.</p><p>AI can save you a ton of time by at least interpreting the error for you and often even fixing your code for you. For this reason, many coding assistants like the ones mentioned above have commands like <code>/fix</code>. </p><blockquote><p><strong>Pro-Tip: </strong>Databricks even has a Public Preview for its AI assistant diagnosing your failed jobs: <a href="https://docs.databricks.com/en/notebooks/use-databricks-assistant.html#diagnose-errors-in-jobs-public-preview">https://docs.databricks.com/en/notebooks/use-databricks-assistant.html#diagnose-errors-in-jobs-public-preview</a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JMVp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JMVp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 424w, https://substackcdn.com/image/fetch/$s_!JMVp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 848w, https://substackcdn.com/image/fetch/$s_!JMVp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 1272w, https://substackcdn.com/image/fetch/$s_!JMVp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JMVp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png" width="603" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:603,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JMVp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 424w, https://substackcdn.com/image/fetch/$s_!JMVp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 848w, https://substackcdn.com/image/fetch/$s_!JMVp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 1272w, https://substackcdn.com/image/fetch/$s_!JMVp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc097e387-0c56-4b83-a0d9-9c5e435d0c5e_603x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I have heard complaints that the error explanations are sometimes wrong or that the suggested fixes frequently have bugs. You can significantly improve the accuracy of these outputs by toggling whether the tool (e.g., Amazon Q or GitHub Copilot) should <strong>index</strong> your entire code workspace. This informs the assistant about your complete codebase and provides better context.</p><div class="pullquote"><p><strong>Why: </strong>Researching and fixing exceptions is faster with the help of AI. Our time is better spent actually resolving the problem, not reading 3 StackOverflow pages, some 2-year-old blog post, and scouring source code on GitHub.</p></div><h1>Conclusion</h1><p>I hope you have enjoyed these use cases and have some new ideas to integrate AI into your data engineering workflows and data products in 2025.</p><p>Remember, these tools and services are on track for maturity and true enterprise readiness, so don&#8217;t settle for less by grabbing the first thing you read about! Try some of the tools I mentioned and decide for yourself. Let me know in the comments if you recommend any other AI tools!</p><p>Also, if you don&#8217;t mind supporting the&nbsp;<strong><a href="https://www.makewithdata.tech/">MakeWithData</a></strong>&nbsp;blog (I promise it&#8217;s really just me, an individual guy who loves to nerd out), please&nbsp;<em>like</em>&nbsp;this post and consider pledging for future posts below!</p><blockquote><p><em>MakeWithData is free today. But if you enjoyed this post, you can tell MakeWithData that their writing is valuable by pledging a future subscription. You won't be charged unless they enable payments.</em></p></blockquote><p></p>]]></content:encoded></item><item><title><![CDATA[Event Driven Design Patterns for Data Engineering]]></title><description><![CDATA[Explore event-driven designs that engineers can use for elegant real-time data processing and ETL.]]></description><link>https://www.makewithdata.tech/p/event-driven-design-patterns</link><guid isPermaLink="false">https://www.makewithdata.tech/p/event-driven-design-patterns</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Wed, 18 Dec 2024 13:30:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xA4c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xA4c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xA4c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!xA4c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!xA4c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!xA4c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xA4c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp" width="490" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/affd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:490,&quot;bytes&quot;:286750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xA4c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!xA4c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!xA4c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!xA4c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd1250-23f1-4e65-b3b6-b07521b5179f_1024x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Introduction</h2><p>Event-driven architecture (EDA) is a software design paradigm that revolves around producing and consuming events to trigger processing. In this architecture, systems are built to respond to events in real-time or near real-time, allowing for more dynamic and responsive applications.</p><p>Data Engineers often find comfort in the simplicity of cron schedules and workflow orchestration tools like Airflow, Kestra, and Databricks Jobs. However, EDA too can be a powerful pattern in the data engineer toolkit.</p><h3>Event-driven data engineer use cases</h3><p>We&#8217;ll focus on design patterns in a sec, but first let&#8217;s recap a few use cases where EDA might come in handy:</p><ol><li><p><strong>File Processing</strong>: process or ingest files with an unpredictable  file arrival schedule.</p></li><li><p><strong>User Submissions: </strong>asynchronously process user submissions such as BI reports, security audits, or data deliveries.</p></li><li><p><strong>Cyber Security: </strong>monitor and react to Endpoint Detection and Response (EDR) events to keep devices and environments secure.</p></li><li><p><strong>IoT data processing: </strong>process and react to streams of data from IoT devices.</p></li><li><p><strong>Healthcare Monitoring</strong>: can monitor patient data from various sources, providing alerts and insights to healthcare professionals for better patient care.</p></li></ol><h2>1. Change Data Capture (CDC)</h2><p>Change data capture, or CDC, is a method of capturing and processing all the inserts/changes/deletes to a table. If you&#8217;re using a data table format such as <a href="https://docs.delta.io/latest/delta-change-data-feed.html">Delta Lake</a> or <a href="https://iceberg.apache.org/docs/nightly/spark-procedures/#change-data-capture">Iceberg</a>, you can simply enable CDC on the table and query the CDC feed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JWpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JWpu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 424w, https://substackcdn.com/image/fetch/$s_!JWpu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 848w, https://substackcdn.com/image/fetch/$s_!JWpu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 1272w, https://substackcdn.com/image/fetch/$s_!JWpu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JWpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png" width="1090" height="316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:316,&quot;width&quot;:1090,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46380,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JWpu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 424w, https://substackcdn.com/image/fetch/$s_!JWpu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 848w, https://substackcdn.com/image/fetch/$s_!JWpu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 1272w, https://substackcdn.com/image/fetch/$s_!JWpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407617d7-eb98-4f63-aad4-4bf8adf160d4_1090x316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Onj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Onj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 424w, https://substackcdn.com/image/fetch/$s_!4Onj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 848w, https://substackcdn.com/image/fetch/$s_!4Onj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!4Onj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Onj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png" width="1456" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308089,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Onj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 424w, https://substackcdn.com/image/fetch/$s_!4Onj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 848w, https://substackcdn.com/image/fetch/$s_!4Onj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!4Onj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef7dcd2a-4fe8-4eeb-860d-7487d7d58a9b_2370x1282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>CDC can be useful for several of the aforementioned use cases, with the exception of file processing, since the table itself can store any structured or semi-structured data.</p><p></p><h2>2. Trigger on File Arrival</h2><p>I am a huge fan of Databricks and have to make a plug for its File Arrival Triggers, which I have an entire separate blog post on here:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;80396e00-f2d1-44cc-b34d-13f6418b96d9&quot;,&quot;caption&quot;:&quot;Databricks released the File Arrival Trigger feature for its Jobs/Workflows this year. Essentially, you point the job at a cloud storage location, such as S3, ADLS, or GCS, which triggers whenever new files arrive.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Trigger Databricks Jobs on File Arrival&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:190683435,&quot;name&quot;:&quot;Zach King&quot;,&quot;bio&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63a71c63-8f9b-44a9-a986-49bd85cdf4ea_1024x1024.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-10-28T22:15:55.276Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.makewithdata.tech/p/trigger-databricks-jobs-on-file-arrival&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:150857816,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;MakeWithData&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70751bc-402f-4ebe-bd35-7cd5e8239d0c_793x793.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><p>However, if you aren&#8217;t on Databricks, have no fear. You can also build this pattern with common cloud services. For example, in AWS you could use <strong>S3 + S3 Event Notifications + Lambda</strong>. Additionally, some orchestration tools like <a href="https://kestra.io/blueprints/s3-trigger-python">Kestra</a> support S3 triggers natively.</p><p>Obviously this can be used to process any arbitrary file data as it lands in storage. Another great use case for this pattern is to take user submissions and generate a manifest file in S3 that describes the request, then use this pattern to trigger a data process against the submission. Some ideas for user submissions could be an intensive data analytics report, kicking off ad-hoc security audits, or a data resynchronization process for runbooks.</p><blockquote><p><strong>Note</strong>: if you are receiving a large number of files and very frequently, a solution using <a href="https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html">Databricks Auto-Loader</a> would scale better.</p></blockquote><p></p><h2>3. SNS + SQS + Lambda</h2><p>Another popular AWS recipe is Simple Notification Service (SNS), Simple Queue Service (SQS), and Lambda Functions. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O7N4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O7N4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 424w, https://substackcdn.com/image/fetch/$s_!O7N4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 848w, https://substackcdn.com/image/fetch/$s_!O7N4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 1272w, https://substackcdn.com/image/fetch/$s_!O7N4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O7N4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png" width="780" height="228" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f885106f-9347-4aef-9508-122415b6587a_780x228.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:228,&quot;width&quot;:780,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O7N4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 424w, https://substackcdn.com/image/fetch/$s_!O7N4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 848w, https://substackcdn.com/image/fetch/$s_!O7N4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 1272w, https://substackcdn.com/image/fetch/$s_!O7N4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff885106f-9347-4aef-9508-122415b6587a_780x228.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>At this point nearly all AWS services have some integrations with SNS, as do many 3rd party applications. </p><p>By integrating your SNS topic with an SQS queue, you will make the processing more scalable, durable, and performant as SQS provides the retry (re-drive) ability, batching, and optionally you can also set up a Dead Letter Queue (DLQ) for stale messages.</p><p>This design pattern is useful for so many use cases because SNS is so easy to integrate. Many monitoring tools like CloudWatch, Datadog, New Relic, Grafana, and others support SNS topics as a notification destination so you could create some alert processors for real-time operations. You could also use this for user activity and IoT events, ingesting it to a lakehouse table or OLTP database.</p><p></p><h2>4.  Streaming Services</h2><p>Moving into more continuous data feeds? High volume, low latency? This is when scalable streaming services such as Kafka, Kinesis, or Apache Pulsar are a better choice.</p><p>Streaming tools like these excel at high-volume, continuous data that needs immediate processing. They also typically store data in a persistent manner, allowing you to replay events if needed.</p><p>Many technologies make it easy to either read or write to/from streams like Kinesis and Kafka; you could:</p><ul><li><p>Trigger an <a href="https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html">AWS Lambda</a> function directly</p></li><li><p>Read/write with <a href="https://docs.databricks.com/en/connect/streaming/kinesis.html">Apache Spark</a></p></li><li><p>Read/write with <a href="https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kinesis/">Apache Flink</a></p></li></ul><p>This would be appropriate for more demanding use cases such as high-volume streams of IoT data, auditing API usage or user activity, real-time anomaly detection, processing financial transactions, and more.</p><p></p><h2>Cautions&#8230;</h2><p>As with any design pattern, we should be mindful of its limitations. In EDA solutions, these limitations may not always exist but some things to be aware of are:</p><ul><li><p><strong>Event Ordering: </strong>does sequence matter to your system? Does your EDA design enforce order? E.g. if using SQS, you may need a FIFO queue, and with Kinesis pay attention to your <code>sequenceNumber</code></p></li><li><p><strong>Duplicates:</strong> can events be duplicated and is this ok? Tip: at times this can actually be a feature, akin to replaying data.</p></li><li><p><strong>Observability: </strong>EDA systems usually have higher complexity; metrics and monitoring are important for observability. E.g. are you able to monitor if events fail to deliver, or are delayed? If it does fall behind, can you monitor the backlog needing to be processed?</p></li><li><p><strong>Complexity: </strong>again, EDA can be more complex to monitor and troubleshoot issues with. Consider if you use case even demands EDA. E.g. if using for file arrival triggers, do your files arrive pretty consistently which could be aligned to a simple schedule? Remember, if all you have is a hammer&#8230; <em>everything</em> looks like a nail.</p></li></ul><h1>Conclusion</h1><p>We have concluded this rundown of a few EDA design patterns for data engineering. These patterns make data ingestion/processing more real-time, potentially more cost-effective, and responsive.</p><p>I hope you have enjoyed this read. I would seriously love to hear feedback and other ideas for your design patterns!</p><p>Also, if you don&#8217;t mind supporting the <strong><a href="https://www.makewithdata.tech">MakeWithData</a></strong> blog (I promise it&#8217;s really just me, an individual guy that just loves to nerd out) please <em>like</em> this post and even consider pledging for future posts below!</p><blockquote><p><em>MakeWithData is free today. But if you enjoyed this post, you can tell MakeWithData that their writing is valuable by pledging a future subscription. You won't be charged unless they enable payments.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.makewithdata.tech/subscribe?&quot;,&quot;text&quot;:&quot;Pledge your support&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.makewithdata.tech/subscribe?"><span>Pledge your support</span></a></p></blockquote><p></p>]]></content:encoded></item><item><title><![CDATA[Spark 4.0 Intro to Custom Data Sources with SQS]]></title><description><![CDATA[Let's explore PySpark's new custom data source feature by making our own source for AWS SQS.]]></description><link>https://www.makewithdata.tech/p/spark-40-intro-to-custom-data-sources</link><guid isPermaLink="false">https://www.makewithdata.tech/p/spark-40-intro-to-custom-data-sources</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 17 Dec 2024 13:31:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/89179cdb-bb03-4908-99ca-a384cedfc779_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Apache Spark v4.0 is coming soon and is jam-packed with awesome features, among which includes the much-awaited <a href="https://spark.apache.org/docs/preview/api/python/user_guide/sql/python_data_source.html">custom data sources API</a> for PySpark (also available in <a href="https://docs.databricks.com/en/pyspark/datasources.html">Databricks Runtime 15.3+</a>). Previously, this feature was only available to Scala programmers.</p><p>This enables data engineers to create integrations for reading and writing data, in batch or with streaming, with idiomatic Spark code. Let&#8217;s take a tour of this new feature by building our own custom data source for AWS Simple Queue Service (SQS).</p><h2>Implementing the DataSource class</h2><p>The first thing we need to create is the <code>DataSource </code>class which is the top-level class that represents our custom data source.</p><pre><code>from pyspark.sql.datasource import (
    DataSource,
    DataSourceReader,
    InputPartition,
)
from pyspark.sql.types import StructType

from datetime import datetime
from typing import Iterator, Tuple
from dataclasses import dataclass


class SQSDataSource(DataSource):
    """
    PySpark data source for batch querying data from a
    SQS queue.
    """

    @classmethod
    def name(cls):
        return "spark_sqs"

    def schema(self):
        return """
            message_id STRING, 
            receipt_handle STRING, 
            md5_of_body STRING, 
            body STRING, 
            sent_timestamp TIMESTAMP
        """

    def reader(self, schema: StructType):
        return SQSDataSourceReader(schema, self.options)
</code></pre><p>The name method tells Spark the name of our custom source (spark_sqs) which is what we&#8217;ll later use in <code>spark.read.format(&#8220;spark_sqs&#8221;)</code>. Then we have a method to return the data source schema, as well as a simple data source reader.</p><blockquote><p>Note: the DataSource API supports streaming readers, as well as writing your own sinks to write data. In this part we will be looking at the simple reader only.</p></blockquote><h2>Implementing the DataSourceReader</h2><p>Next, we must implement the simple data source reader.</p><pre><code>class SQSDataSourceReader(DataSourceReader):
    def __init__(self, schema: StructType, options: dict):
        print('Initializing SQS Data Source Reader')
        self.schema: StructType = schema
        self.options: dict = options
        if not options.get('queue_url'):
            raise ValueError('queue_url is required')
        self.queue_url = options.get('queue_url')
        self.region_name = options.get('region_name', 'us-east-1')
        self.max_messages = int(options.get('max_messages', '10'))
        self.visibility_timeout = int(options.get('visibility_timeout', '20'))
        self.wait_time_seconds = int(options.get('wait_time_seconds', '20'))
        self.delete_message = options.get('delete_message', 'false').lower() == 'true'


    def read(self, partition: InputPartition) -&gt; Iterator[Tuple]:
        import boto3 # Must import here for serialization
        sqs = boto3.client('sqs', region_name=self.region_name)
        response = sqs.receive_message(
            QueueUrl=self.queue_url,
            MaxNumberOfMessages=self.max_messages,
            WaitTimeSeconds=self.wait_time_seconds,
            VisibilityTimeout=self.visibility_timeout,
            MessageSystemAttributeNames=['SentTimestamp']
        )
        for message in response.get('Messages', []):
            yield (
                message['MessageId'],
                message['ReceiptHandle'],
                message['MD5OfBody'],
                message['Body'],
                datetime.fromtimestamp(int(message['Attributes']['SentTimestamp']) / 1000)
            )
            if self.delete_message:
                sqs.delete_message(
                    QueueUrl=self.queue_url,
                    ReceiptHandle=message['ReceiptHandle']
                )</code></pre><p>This is mostly handled by the read method, which we will use to receive messages from the queue and yield them; the rest of this code is just boilerplate to offer flexibility so we may configure SQS poll settings.</p><h2>Using the custom data source</h2><p>To use our custom data source, we must first register it to make Spark aware of the implementation.</p><pre><code>spark.dataSource.register(SQSDataSource)</code></pre><p>Now we can read from a queue with idiomatic Spark code:</p><pre><code>spark.read.format("spark_sqs").options(
        queue_url='https://sqs.us-east-1.amazonaws.com/1234567890/makewithdata-events',
        region_name='us-east-1',
    ).load().show()</code></pre><pre><code>+--------------------+--------------------+--------------------+--------------------+--------------------+
|          message_id|      receipt_handle|         md5_of_body|                body|      sent_timestamp|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|7322a8a2-180f-4a3...|AQEBb9rBz2DJ8RAWA...|c89b42169df3bbb3e...|{\n    "_id": "67...|2024-12-16 20:13:...|
|d6a1d9b8-c15f-44c...|AQEBVnrFbfP/qrBpg...|daa063a083235cf91...|{\n    "_id": "67...|2024-12-16 20:13:...|
+--------------------+--------------------+--------------------+--------------------+--------------------+</code></pre><h2>Next Steps</h2><p>This is merely a quick way to play around with custom sources, but our SQS source could be a lot more useful if we added more features:</p><ol><li><p>Implement Streaming Reader that deletes messages as the reader writes checkpoints.</p></li><li><p>Poll until queue is empty.</p></li><li><p>Implement a Sink so we can write data to SQS as well.</p></li><li><p>Extract all available Attribute data from the messages.</p></li></ol><p>If you&#8217;d like to see these next parts implemented, leave me a comment below and subscribe to the MakeWithData blog!</p>]]></content:encoded></item><item><title><![CDATA[Databricks Workflow Mistakes You’re Probably Making (And How to Fix Them)]]></title><description><![CDATA[Today, we will look at many of the most common pitfalls I've seen (and learned the hard myself a time or two) and how to fix them using best practices and rich features from the famous Data Intelligence platform!]]></description><link>https://www.makewithdata.tech/p/databricks-workflow-mistakes</link><guid isPermaLink="false">https://www.makewithdata.tech/p/databricks-workflow-mistakes</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Mon, 18 Nov 2024 13:02:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PGMt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PGMt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PGMt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!PGMt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!PGMt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!PGMt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PGMt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png" width="476" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:476,&quot;bytes&quot;:249660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PGMt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!PGMt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!PGMt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!PGMt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9bbf7ae-3fad-4ecc-bd79-04aa0358713f_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Today, we will look at many of the most common pitfalls I've seen (and learned the hard myself a time or two) and how to fix them using best practices and rich features from the famous Data Intelligence platform!</p><h2>1. Relying on Schedules to "orchestrate" Tasks</h2><p>If you're like me and started using Databricks several years ago, you know that we could not always run multiple tasks in a job, much less orchestrate them. Often, data engineers need to break down a job into smaller steps or tasks, such as separating runtime dependencies, creating sequential and parallelized flow, or modularizing the code. A common mistake here is to build separate jobs and do a bit of cron schedule surgery to align these jobs so they run in a well-timed chain.</p><p>For example, we may build a data pipeline for a typical medallion architecture like so:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RXTy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RXTy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 424w, https://substackcdn.com/image/fetch/$s_!RXTy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 848w, https://substackcdn.com/image/fetch/$s_!RXTy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!RXTy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RXTy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png" width="1456" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RXTy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 424w, https://substackcdn.com/image/fetch/$s_!RXTy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 848w, https://substackcdn.com/image/fetch/$s_!RXTy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!RXTy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e032e4e-0d1f-420c-8e4d-84b651f34534_4498x1058.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>Meanwhile, other users created their own Apache Airflow servers to orchestrate the task. Fortunately, Databricks saw the need for robust orchestration features and now has several features built-in like multiple tasks in one job, task dependency (can run some tasks in parallel, or others in sequence), as well as conditional task runs (e.g., "run this task if XYZ happened"). The above example is wrong because it is too brittle--if any of the jobs takes too long, it "misses the train" for the next step. Likewise, if any of the jobs fails, there is no way for the other jobs to know to skip, so instead, they run unnecessarily, adding more cloud costs and potential alerts.</p><p>Using these features, we should only have one job for this use case. Each step is a <strong>Task</strong> configured to depend on the previous task completed. You can even refresh an SQL dashboard as one of your job tasks. However, at the time of this writing, Databricks only supports legacy SQL dashboards in this feature. Hopefully, it will add Lakeview Dashboards soon!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DAM0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DAM0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 424w, https://substackcdn.com/image/fetch/$s_!DAM0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 848w, https://substackcdn.com/image/fetch/$s_!DAM0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 1272w, https://substackcdn.com/image/fetch/$s_!DAM0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DAM0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png" width="1456" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:487393,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DAM0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 424w, https://substackcdn.com/image/fetch/$s_!DAM0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 848w, https://substackcdn.com/image/fetch/$s_!DAM0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 1272w, https://substackcdn.com/image/fetch/$s_!DAM0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad91699-428b-4cf1-bfab-a36016dea37b_5670x1526.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>2. No Retries</h2><p>Another common mistake is forgetting to configure <code>max_retries</code> on your Databricks job or its tasks. This is one of the easiest ways to add fault tolerance to your jobs and prevent transient issues from derailing critical workflows.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9DfK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9DfK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 424w, https://substackcdn.com/image/fetch/$s_!9DfK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 848w, https://substackcdn.com/image/fetch/$s_!9DfK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 1272w, https://substackcdn.com/image/fetch/$s_!9DfK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9DfK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png" width="1456" height="902" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:902,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;0003-databricks-retries.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="0003-databricks-retries.png" title="0003-databricks-retries.png" srcset="https://substackcdn.com/image/fetch/$s_!9DfK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 424w, https://substackcdn.com/image/fetch/$s_!9DfK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 848w, https://substackcdn.com/image/fetch/$s_!9DfK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 1272w, https://substackcdn.com/image/fetch/$s_!9DfK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3934b586-a74d-4a6d-bbd1-8b33c34df3a9_2864x1774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>3. Let it Fail! Don't Swallow Exceptions</h2><p>Retries are tremendous but only valid when the task actually "fails." This can be an oversight if you write code that catches exceptions and then sends them to a monitoring tool like Prometheus, CloudWatch, DataDog, etc.</p><pre><code><code>cw = boto3.client('cloudwatch', region_name=region_name)

try:
&#9;df = spark.read.table("...")
&#9;# application code...
except Exception as e:
&#9;# Log error to CloudWatch
&#9;cw.put_metric(...)

&#9;# BAD!! Exception gets swallowed, and the job run is considered "successful."
</code>Copy</code></pre><p>If you want to instrument some error metrics this way, remember to re-throw the exception to allow Databricks to flag the task as "failed." This will trigger the retries, send notifications, and enable other features like conditional runs.</p><pre><code><code>except Exception as e:
&#9;# Log error to CloudWatch
&#9;cw.put_metric(...)

&#9;# Re-throw to trigger job failure
&#9;raise e
</code>Copy</code></pre><h2>4. Not Using Notifications</h2><p>One more easy victory in Databricks Workflows is the notifications. As of this writing, Databricks supports notifications via email, Slack, Microsoft Teams, Pagerduty, and generic Webhooks.</p><p>This often replaces convoluted monitoring solutions that require maintaining custom integrations or instrumentation in your code.</p><h2>5. Git Folder vs. Git Integration for Notebooks</h2><p>This may be a personal preference, but I strongly support it. If you are using "Git Folders" (formerly known as "Git Repos") in the Databricks workspace as the source for your notebook jobs, you should strongly consider switching to <a href="https://docs.databricks.com/en/jobs/configure-job.html#use-git-with-jobs">Git Integration</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yl-K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yl-K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 424w, https://substackcdn.com/image/fetch/$s_!Yl-K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 848w, https://substackcdn.com/image/fetch/$s_!Yl-K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 1272w, https://substackcdn.com/image/fetch/$s_!Yl-K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yl-K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png" width="1278" height="786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:786,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;0004-git-integration.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="0004-git-integration.png" title="0004-git-integration.png" srcset="https://substackcdn.com/image/fetch/$s_!Yl-K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 424w, https://substackcdn.com/image/fetch/$s_!Yl-K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 848w, https://substackcdn.com/image/fetch/$s_!Yl-K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 1272w, https://substackcdn.com/image/fetch/$s_!Yl-K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94b4826-6c2f-413d-b168-2ec66039d992_1278x786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What's the difference? Git folders are typically created by users and kept in their user directory on the workspace. More importantly, Git folders require explicitly pulling to receive changes; if you go this route, you should use the Databricks SDK from your CI/CD pipeline (e.g., GitHub Actions) to automate checking out new tags/branches or pulling changes from a main branch.</p><p>Jobs have built-in Git integration that relieves much of the stress. As shown in the picture above, this feature allows your job to effectively checkout the specified repo and Git ref when a run starts. If you point it at a main branch like 'main' or 'master,' you can achieve a minimal "CD" pipeline with zero CI/CD tooling!</p><h2>6. Someone Leaves the Company, Jobs Fail?</h2><p>If this one has burned you, I apologize, but we'll quickly learn how to avoid it in the future. Databricks makes it very easy to convert a prototype notebook into an automated/scheduled job&#8212;almost too easy. When users create jobs via the UI, they often leave the default "Run As" set to themselves and think nothing of it. The job runs fine for months or years; then, one day, that user leaves the company. The next day, your team is greeted by failure alerts and emails about missing data!</p><p>This is very common, as organizations take measures to close accounts and access as a regular part of employee exit policy. Our Databricks jobs may fail when they run as a user and rely on that user's access to Unity Catalog or notebooks in their workspace directory (although default settings in Databricks do not delete workspace files after the user is deactivated).</p><p>The solution concept is not unique to Databricks as this is a common problem in any platform as a service: <a href="https://docs.databricks.com/en/admin/users-groups/service-principals.html">Service Principals</a>.</p><p>Honestly, this is one of the more tedious resources to manage in Databricks. As a Databricks Account Administrator, we can create new service principals easily, assign them access to workspaces, and create OAuth credentials for Machine-to-Machine (M2M) authentication in our applications. However, to combine this with the lesson learned in #5, we must use the OAuth credentials for the Service Principal to impersonate it and call the Databricks REST API <a href="https://docs.databricks.com/api/workspace/gitcredentials/create">POST /api/2.0/git-credentials</a>. This effectively impersonates the Service Principal and creates Git Credentials for it.</p><blockquote><p><strong>Note:</strong> Databricks, if you are reading this, please consider adding the ability to manage Git credentials for Service Principals. You can upvote my idea on the Databricks Ideas Portal <a href="https://ideas.databricks.com/ideas/DBE-I-1549">here</a>!</p></blockquote><p>Once you get it set up (hopefully not often), Service Principals make your jobs much more reliable to changes in org structure or employees leaving the company, and it enhances the overall security within your Unity Catalog by not over-exposing write access to individual users--your coworker, <em>Bob</em>, is not the one writing to that gold table, it's the <em>daily-customer-report</em> <strong>Job</strong> that writes to it &#128521;.</p><h2>Conclusion</h2><p>Let's wrap up. Databricks Workflows are awesome--one of the platform's most mature offerings in my opinion--and with these tips we can maximize their resiliency and usability.</p><p>I implore you to consider these lessons the next time you create a new Databricks Workflow. If you have tips and tricks for Databricks yourself, please let me know in the comments!</p><p>Follow for more content like this!</p>]]></content:encoded></item><item><title><![CDATA[Trigger Databricks Jobs on File Arrival]]></title><description><![CDATA[Use this new feature for event-driven Databricks Jobs to trigger when files arrive in your cloud storage. We'll look at 4 example use cases of when you should use file arrival triggers to enhance your workflows.]]></description><link>https://www.makewithdata.tech/p/trigger-databricks-jobs-on-file-arrival</link><guid isPermaLink="false">https://www.makewithdata.tech/p/trigger-databricks-jobs-on-file-arrival</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Mon, 28 Oct 2024 22:15:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kLNZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kLNZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!kLNZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!kLNZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!kLNZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kLNZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1600447,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kLNZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!kLNZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!kLNZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!kLNZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6df062-8f99-466b-8cfc-f595ed1edea2_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Databricks released the <a href="https://docs.databricks.com/en/jobs/file-arrival-triggers.html">File Arrival Trigger</a> feature for its Jobs/Workflows this year. Essentially, you point the job at a cloud storage location, such as S3, ADLS, or GCS, which triggers whenever new files arrive.</p><p>This is very useful for creating event-driven jobs. Previously, accomplishing this would have required gluing together a lot of cloud-specific tooling, such as AWS S3 combined with S3 Events combined with a Lambda function that finally invokes the Databricks SDK / API / CLI to run the job. So, I see this new feature as a great simplification and a very practical one.</p><p>Let's tour a few use cases where you may want to use file arrival triggers.</p><h2>1. Irregular Schedule</h2><p>The first use case is even called out in the aforementioned Databricks docs:</p><blockquote><p>You can use this feature when a scheduled job might be inefficient because new data arrives on an irregular schedule.</p></blockquote><p>Say you receive files on S3 from a data vendor that collects and provides cross-reference data for your industry. The vendor does not have a regular cron schedule for delivering new files. There is usually one file per day, anywhere from 1 pm to 6 pm UTC, but because they push the files manually, there sometimes isn't a file received on holidays or weekends if they forget to push it or have backup staff.</p><ul><li><p><strong>Naive Solution:</strong> Schedule a job to run every day from 1 p.m. to 6 p.m. (six runs per day). Write code in the job to look for new files and exit if none are found.</p><ul><li><p>We have guaranteed at least five wasted runs every day. More on holidays and weekends when there are no files.</p></li></ul></li><li><p><strong>File Arrival Solution:</strong> use file arrival triggers. There are no wasted runs, and the code focuses on the business logic.</p></li></ul><h2>2. Batch Submissions</h2><p>We now work for a FinTech company, and the data team works closely with the front-end team. The front-end team wants users to be able to request an export of all their transactions for a given date range, then send an email to the user with file attachments.</p><p>We can build a solution using file arrival triggers and a clever manifest file. We created the following design.</p><p>When a user fills out the submission form on the web app, the front end generates a JSON manifest file that encapsulates their input:</p><pre><code><code>{
&#9;"userId": 123,
&#9;"emailRecipients": [
&#9;&#9;"john.doe@example.com"
&#9;],
&#9;"requestType": "EMAIL_REPORT",
&#9;"parameters": {
&#9;&#9;"dateRange": {
&#9;&#9;&#9;"start": "2024-01-01",
&#9;&#9;&#9;"end": "2024-09-01"
&#9;&#9;}
&#9;}
}</code></code></pre><p>This file is dropped into an S3 bucket, which triggers a Databricks Job. Our code processes the request and emails the user. This solution can easily extend with additional parameters and request types as the application evolves.</p><h2>3. User Content Moderation</h2><p>Similar to #1, we may receive irregular data when end users generate it. Working at a media hosting company, we have images and videos uploaded by users sporadically. We are required to moderate and tag the content for anything out of compliance.</p><p>File arrival triggers will allow you to create a job focusing on content moderation (e.g., illicit image detection, age restrictions, copyright violation, etc.) as new images/videos are uploaded.</p><p>For an end-to-end solution for content moderation, the job may ingest the unstructured data to a Delta Lake table and store additional columns, such as the auto-moderation status and tags.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7pei!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7pei!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 424w, https://substackcdn.com/image/fetch/$s_!7pei!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 848w, https://substackcdn.com/image/fetch/$s_!7pei!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 1272w, https://substackcdn.com/image/fetch/$s_!7pei!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7pei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png" width="169" height="628" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:628,&quot;width&quot;:169,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Pasted image 20241028140114.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Pasted image 20241028140114.png" title="Pasted image 20241028140114.png" srcset="https://substackcdn.com/image/fetch/$s_!7pei!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 424w, https://substackcdn.com/image/fetch/$s_!7pei!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 848w, https://substackcdn.com/image/fetch/$s_!7pei!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 1272w, https://substackcdn.com/image/fetch/$s_!7pei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f15a03-53b9-4a89-8107-0f23f4093a4b_169x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>4. CCPA Requests</h2><p>The California Consumer Privacy Act (CCPA) provides consumers with rights, such as requesting that their data be removed from companies' products or deleted altogether.</p><p>Similar to #2, a good solution is to use file arrival triggers to process manifest files encapsulating the CCPA requests. We can create a Databricks Job to process the file/request, updating tables and/or deleting from them where rows match the requested email address or PII.</p><blockquote><p><strong>Bonus Topic:</strong> We can facilitate this even better by using table and column <em>tags</em> in Unity Catalog to tag data containing PII. The information can then be retrieved programmatically by querying the <a href="https://docs.databricks.com/en/sql/language-manual/sql-ref-information-schema.html">information_schema</a>.</p></blockquote><p></p><h2>Conclusion</h2><p>We barely scratched the surface of what's possible with File Arrival Triggers. I encourage you to try them out and hope Databricks continues these strategic technical features in Workflows.</p><p>I love seeing solutions across industries--let me know in the comments if you have another use case for file arrival triggers!</p>]]></content:encoded></item><item><title><![CDATA[Practical Terraform: You're Doing it Wrong (Part 2)]]></title><description><![CDATA[This is Part 2 of the set of practical Terraform tips. Supercharge your Infrastructure-as-Code to be more resilient, easier to maintain, and idiomatic.]]></description><link>https://www.makewithdata.tech/p/practical-terraform-youre-doing-it-b97</link><guid isPermaLink="false">https://www.makewithdata.tech/p/practical-terraform-youre-doing-it-b97</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 15 Oct 2024 04:30:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!hrB0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hrB0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hrB0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hrB0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hrB0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hrB0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hrB0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg" width="800" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:337306,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hrB0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hrB0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hrB0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hrB0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e59c5b-20f3-47e6-a727-f39ddb66300e_800x800.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is Part 2 of the set of practical Terraform tips. If you missed Part 1, check it out here: </p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:149543235,&quot;url&quot;:&quot;https://makewithdata.substack.com/p/practical-terraform-youre-doing-it&quot;,&quot;publication_id&quot;:3061796,&quot;publication_name&quot;:&quot;MakeWithData&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70751bc-402f-4ebe-bd35-7cd5e8239d0c_793x793.png&quot;,&quot;title&quot;:&quot;Practical Terraform: You're Doing it Wrong&quot;,&quot;truncated_body_text&quot;:&quot;We've all written Terraform IaC that we're not proud of before--it happens. I'm here today to talk about the Terraform you write that you think you're proud of...until it outgrows your team, becomes hard to manage, and terrifies you anytime you terraform apply&quot;,&quot;date&quot;:&quot;2024-10-14T12:03:06.349Z&quot;,&quot;like_count&quot;:0,&quot;comment_count&quot;:0,&quot;bylines&quot;:[{&quot;id&quot;:190683435,&quot;name&quot;:&quot;Zach King&quot;,&quot;handle&quot;:&quot;makewithdata&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63a71c63-8f9b-44a9-a986-49bd85cdf4ea_1024x1024.jpeg&quot;,&quot;bio&quot;:null,&quot;profile_set_up_at&quot;:&quot;2024-09-22T20:22:04.194Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:3115715,&quot;user_id&quot;:190683435,&quot;publication_id&quot;:3061796,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:3061796,&quot;name&quot;:&quot;MakeWithData&quot;,&quot;subdomain&quot;:&quot;makewithdata&quot;,&quot;custom_domain&quot;:&quot;makewithdata.tech&quot;,&quot;custom_domain_optional&quot;:true,&quot;hero_text&quot;:&quot;MakeWithData is where I share content to all data + AI practitioners. Whether you come from a data analyst, data engineer, data scientist, or business background, you'll find content that resonates with modern issues and new trends in the industry.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e70751bc-402f-4ebe-bd35-7cd5e8239d0c_793x793.png&quot;,&quot;author_id&quot;:190683435,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2024-09-22T20:22:35.418Z&quot;,&quot;rss_website_url&quot;:null,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Zach King&quot;,&quot;founding_plan_name&quot;:&quot;Founding Member&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:false,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://makewithdata.substack.com/p/practical-terraform-youre-doing-it?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!hQXU!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70751bc-402f-4ebe-bd35-7cd5e8239d0c_793x793.png"><span class="embedded-post-publication-name">MakeWithData</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">Practical Terraform: You're Doing it Wrong</div></div><div class="embedded-post-body">We've all written Terraform IaC that we're not proud of before--it happens. I'm here today to talk about the Terraform you write that you think you're proud of...until it outgrows your team, becomes hard to manage, and terrifies you anytime you terraform apply&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">2 years ago &#183; Zach King</div></a></div><p>We'll explore several more Terraform pitfalls and how to avoid them so your infra teams can succeed long-term!</p><h2>1. Too many conditional resources</h2><p>Have you ever seen a module that creates too many conditional resources using syntax like <code>for_each</code> and <code>count</code> and ternary operators? These are less readable and generate a lot of complexity for maintainers.</p><p>For example:</p><pre><code><code># file: modules/storage/main.tf

variable "create_s3_bucket" {
  type        = boolean
  description = "Whether to create an S3 bucket as well"
  default     = false
}

resource "aw_s3_bucket" "this" {
  count = var.create_s3_bucket ? 1 : 0
  ...
}

# Other storage resources... EFS volumes, Backup configuration, etc.
...</code></code></pre><p>Used in moderation, this can be a fine way to add flexibility to your module. However, take it too far, and you'll have a Frankenstein of faux-array references like <code>aws_s3_bucket.this[0]</code> or worse: <code>var.create_s3_bucket ? aws_s3_bucket.this[0] : ""</code>. It also makes your configuration more difficult to read and reason with.</p><p>When it becomes too much, decoupling these conditional components into a separate module is better. Instead of using ternary operators, you can instantiate a module or don't.</p><h2>2. Resources that don't belong in a Module</h2><p>We will only travel a short distance from the previous example. Let's say we have the above module with the conditionally-created S3 bucket. We may also have a few folders for deploying to Dev, Test, and Prod environments:</p><pre><code><code># file: deployments/dev/main.tf

module "storage" {
  source           = "../../modules/storage"
  create_s3_bucket = true
  env              = "dev"
}</code></code></pre><pre><code><code># file: deployments/test/main.tf

module "storage" {
  source           = "../../modules/storage"
  create_s3_bucket = false
}</code></code></pre><pre><code><code># file: deployments/prod/main.tf

module "storage" {
  source           = "../../modules/storage"
  create_s3_bucket = false
}</code></code></pre><p>In this scenario, we create the S3 bucket only in one environment. This is typical of infrastructure resources unique to one environment's purposes, such as Developer sandboxes, QA tooling, and ad-hoc troubleshooting devices.</p><p>Using conditional resources with variables is the wrong way to solve this use case; if a resource is specific to one environment, you should only create the resource <em>in that environment's deployment</em>. Either of the following refactors would work in this example:</p><ol><li><p>Create a new module named <code>s3_storage</code> and move the S3 bucket and related resources inside. Remove the variable and <code>count</code> syntax, then instantiate the module only during the dev deployment.</p></li><li><p>Don't use a module for the S3 bucket and related resources. Simply put them in the <code>deployments/dev/</code> folder directly.</p></li></ol><h2>3. Sandbox Environments</h2><p>This next tip is a godsend for parallelization and will also increase your confidence in your ability to spin up <em>and</em> tear down the infrastructure.</p><p>We often write Terraform with our environments in mind, like Dev, Test, Stage, Prod, etc. We create resources with the environment keyword in the resource name, such as naming a Lambda function <code>sqs-ingestor-${var.env}-${var.region}</code>, which creates <code>sqs-ingestor-dev-us-east-1</code>. This may be fine for your team, especially on a smaller scale and when you work by yourselves; however, what do you do when your colleague needs to test their version of the Lambda function while you're testing your feature branch?</p><p>This is where Terraform <a href="https://developer.hashicorp.com/terraform/language/state/workspaces">Workspaces</a> come in very handy. Say the two developers are myself and Sara Hollis, working on different new features simultaneously. If we plan the Terraform with parallelized work-streams in mind, we can name the Lambda function including a Terraform built-in variable <code>${terraform.workspace}</code>, which contains the name of the current workspace.</p><pre><code><code>resource "aws_lambda_function" "sqs_ingestor" {
  name = "sqs-ingestor-${terraform.workspace}"

  tags = {
    Name = "sqs-ingestor-${terraform.workspace}"
  }
}</code></code></pre><p>Then I proceed with my work:</p><pre><code><code>zcking&gt; git checkout feature/new-redundancy-options
zcking&gt; terraform workspace new dev-zcking
zcking&gt; terraform workspace select dev-zcking
zcking&gt; terraform init &amp;&amp; terraform apply</code></code></pre><p>While Sara does the same for her work:</p><pre><code><code>shollis&gt; git checkout feature/json-schema-evolution
shollis&gt; terraform workspace new dev-shollis
shollis&gt; terraform workspace select dev-shollis
shollis&gt; terraform init &amp;&amp; terraform apply</code></code></pre><p>Both will apply the resources successfully and in parallel--two Lambda functions will exist afterward. Terraform calls these <em>workspaces</em>, but I also like to refer to them as <em>sandboxes</em> because we each have our own isolated environment to play in, like a sandbox.</p><blockquote><p><strong>Note:</strong> This is usually feasible, but every use case is different. Sometimes, you may prefer to keep certain resources global or shared, such as ECS/EKS clusters, databases, and others&#8212;usually for cost and data reasons.</p></blockquote><h2>4. Remote States</h2><p>My final tip for creating a more practical Terraform configuration is a data query to remote state files. This is where you programmatically query resource information stored in another remote Terraform state file, such as from another environment or team. By querying a remote state file, you can reuse outputs and modularize your project further without requiring the resources to be managed in the same deployable scope of Terraform code.</p><p>For example, we may deploy core networking infrastructure like our VPC with one state file:</p><pre><code><code>terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "vpc/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_vpc" "main" { ... }

output "vpc_id" { 
  value = aws_vpc.main.id
}</code></code></pre><p>Now in a separate Terraform project, we can query the remote state and access the output to deploy an EC2 instance into the VPC:</p><pre><code><code>data "terraform_remote_state" "vpc" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "vpc/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t2.micro"
  vpc_security_group_ids = [aws_security_group.web_sg.id]
  subnet_id     = data.terraform_remote_state.vpc.outputs.vpc_id
}</code></code></pre><p>You may wonder, "Couldn't I just use a data query to the aws_vpc resource rather than interrogate the Terraform state?" Technically, yes, and in your case, you may prefer that simplicity.</p><p>The tradeoff is that when using a traditional data query, you will need to include filters based on resource ID, tags, or other attributes; furthermore, some resources do not offer a data query to look up the infrastructure. Querying a remote state defers to the source of truth&#8212;the Terraform that deployed the dependency resources, like the VPC in this case.</p><h2>Conclusion</h2><p>I hope you enjoyed this expansion of my tips for practical Terraform-ing! </p><p>Follow for more content like this!</p>]]></content:encoded></item><item><title><![CDATA[Practical Terraform: You're Doing it Wrong (Part 1)]]></title><description><![CDATA[We've all written Terraform IaC that we're not proud of before&#8202;-&#8202;it happens. I'm here today to talk about the Terraform you write that you think you're proud of&#8230;until it outgrows your team, becomes hard to manage, and terrifies you anytime you terraform apply.]]></description><link>https://www.makewithdata.tech/p/practical-terraform-youre-doing-it</link><guid isPermaLink="false">https://www.makewithdata.tech/p/practical-terraform-youre-doing-it</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Mon, 14 Oct 2024 12:03:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NgcD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NgcD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NgcD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NgcD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NgcD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NgcD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NgcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189644,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NgcD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NgcD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NgcD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NgcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6292a7-525b-438a-92de-3d4199bfa966_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>We've all written Terraform IaC that we're not proud of before--it happens. I'm here today to talk about the Terraform you write that you <em>think</em> you're proud of...until it outgrows your team, becomes hard to manage, and terrifies you anytime you <code>terraform apply</code>.</p><h2>1. Monolithic Modules</h2><p>Terraform modules are a set of reusable code, similar to a class in object-oriented programming. I like to say that the module is the "cookie cutter"; anytime you use it, you are "instantiating" it.</p><p>When you create a module, try to keep it short and sweet. Users of your module don't want to worry about a heap of baggage, like 50+ variables, 4+ providers, and 100+ direct resources, spat out in the first plan.</p><p>Take this module for a web application, for example:</p><pre><code><code>webapp_bad
&#9500;&#9472;&#9472; api_gateway.tf
&#9500;&#9472;&#9472; cloudfront.tf
&#9500;&#9472;&#9472; cloudwatch.tf
&#9500;&#9472;&#9472; cognito.tf
&#9500;&#9472;&#9472; iam.tf
&#9500;&#9472;&#9472; lambda.tf
&#9500;&#9472;&#9472; load_balancer.tf
&#9500;&#9472;&#9472; outputs.tf
&#9500;&#9472;&#9472; postgres.tf
&#9500;&#9472;&#9472; providers.tf
&#9500;&#9472;&#9472; variables.tf
&#9492;&#9472;&#9472; vpc.tf</code></code></pre><p>This is a typical stack for a web application on AWS, but I'll let your imagination fill in the myriad of resources. It's a lot. The mistake here is that we've created a monolithic module that only serves the purpose of 1 specific application stack.</p><p>A better approach would be to decouple these into smaller, more reusable units:</p><pre><code><code>webapp_good/
&#9500;&#9472;&#9472; backend/...
&#9500;&#9472;&#9472; frontend/...
&#9492;&#9472;&#9472; network/...</code></code></pre><h2>2. Useless Modules</h2><p>Next on our pitfall hit list are the modules that simply aren't needed. If you find yourself writing a module with variables for almost every argument defined on the resources, ask yourself: Am I accomplishing something novel or simply repeating what's already easily handled by the direct resource(s) I'm creating? Likewise, we don't want a module containing only 1-2 resources, as that could be achieved easily on our own.</p><p>Remember, modules should accomplish one or more of the following tasks:</p><ol><li><p>Automate standard configuration or resource best practices (e.g., tagging, security policies, naming conventions, etc.).</p></li><li><p>Organize configuration into reusable blocks of code, with some flexibility but not so much that we re-implement the individual resources.</p></li><li><p>Break down a complex infrastructure solution into smaller units that are easier to maintain.</p></li></ol><p>For example, let's try not to write useless modules like this s3 bucket module:</p><pre><code><code># file: useless_s3/main.tf

variable "bucket" {
  type = string
}

variable "acl" {
  type = string
}

resource "aws_s3_bucket" "this" {
  bucket = var.bucket
  acl    = var.acl
}</code></code></pre><pre><code><code># file: main.tf

module "useless_s3_bucket" {
  source = "./modules/useless-s3"
  
  bucket = "my-useless-bucket"
  acl    = "private"
}</code></code></pre><p>This can be achieved without a module. The module here only adds another layer of nesting, and it will surely frustrate your team when they have to traverse the module tree just to add a simple argument to the underlying <code>aws_s3_bucket</code> resource.</p><h2>3. Folder Layout &amp; Blast Radius</h2><p>Next, let's talk about file structure when writing Terraform. This is crucial when scaffolding your Infrastructure-as-Code configuration for a brand-new project. You can be a whiz at the HCL syntax and functions but still create inefficiencies and undue complexity if not laid out carefully.</p><p>Try to think about folders in terms of two types: <strong>modules</strong> and <strong>deployments</strong>. Modules are where you define a set of re-usable Terraform and should represent components of your infrastructure, like ingredients in a recipe. I typically refer to the folder where you <code>terraform apply</code> from as the deployment scope.</p><p>A common mistake is to create a monolithic module that defines the entire infrastructure solution but with variables, then create multiple deployments of this mono-module, such as a dev, test, and prod deployment:</p><pre><code><code>&#9500;&#9472;&#9472; deployments
&#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; dev/
&#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; test/
&#9474;&nbsp;&nbsp; &#9492;&#9472;&#9472; prod/
&#9500;&#9472;&#9472; modules
&#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; backend_app/
&#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; ecs/
&#9474;&nbsp;&nbsp; &#9492;&#9472;&#9472; frontend_app/</code></code></pre><p>This example could be better because it creates a huge blast radius for each of the deployments/environments. Any change within the <code>frontend_app</code> will still require <code>terraform apply</code> to re-scan configuration for resources in instance of the <code>ecs/</code> module too.</p><p>Consider breaking your deployments into more than one deployable scope. Ask yourself the following questions:</p><ol><li><p>How often will these resources need to be changed/redeployed?</p></li><li><p>How critical are these resources? Are they more or less critical than other groups of resources?</p></li></ol><p>Critical resources such as the VPC and ECS/EKS clusters are great to segregate into their deployable scope/folder; those do not change very often and are vital to the operations of application services. This ensures a safer, leaner, more agile Terraform project structure like so:</p><pre><code><code>&#9500;&#9472;&#9472; deployments
&#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; dev
&#9474;&nbsp;&nbsp; &#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; app //         &lt;-- Instantiates backend_app/ and frontend_app/
&#9474;&nbsp;&nbsp; &#9474;&nbsp;&nbsp; &#9492;&#9472;&#9472; ecs_cluster//  &lt;-- Instantiates ecs/ and VPC resources
&#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; prod
&#9474;&nbsp;&nbsp; &#9474;&nbsp;&nbsp; &#9500;&#9472;&#9472; app
&#9474;&nbsp;&nbsp; &#9474;&nbsp;&nbsp; &#9492;&#9472;&#9472; ecs_cluster
&#9474;&nbsp;&nbsp; &#9492;&#9472;&#9472; test
&#9474;&nbsp;&nbsp;     &#9500;&#9472;&#9472; app
&#9474;&nbsp;&nbsp;     &#9492;&#9472;&#9472; ecs_cluster
&#9492;&#9472;&#9472; modules
    &#9500;&#9472;&#9472; backend_app
    &#9500;&#9472;&#9472; ecs
    &#9492;&#9472;&#9472; frontend_app</code></code></pre><h2>4. Version Constraints</h2><p>Terraform supports version constraints for the <code>required_providers {}</code> as well as the version of Terraform itself. We'll focus on providers for now. Let's examine how you might use version constraints with increasingly better examples.</p><h3>Worst</h3><pre><code><code>terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }
}</code></code></pre><p>The worst way to use version constraints is not to use them at all.</p><h3>Bad</h3><pre><code><code>terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.71.0"
    }
  }
}</code></code></pre><p>I might spark some controversy by saying this, but yes, I think this is bad. Terraform providers like AWS, Azure, GCP, and others constantly release new versions, constantly adding new features and fixing bugs.</p><p>By pinning your provider to a specific version, you will likely find yourself combing through dozens of Terraform modules every now and then to update every single version constraint. There's no forgiveness here--all your Terraform must agree on the exact same version of the provider--making it a colossal pain to perform upgrades or receive new features/fixes.</p><h3>Better</h3><pre><code><code>terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~&gt; 5.71.0"
    }
  }
}</code></code></pre><p>This is better because it uses the <em>pessimistic constraint operator</em> <code>~&gt;</code>, which in this example will allow the last version number (the patch version) to change, but not the major or minor version numbers.</p><p>It allows updates only to patch versions of <code>5.71.x</code>, meaning only versions that are <code>&gt;= 5.71.0</code> and <code>&lt; 5.72.0</code>. Again, we will have to painstakingly update tons of these lines in our modules when we inevitably need to upgrade versions.</p><h3>Best</h3><pre><code><code>terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "&gt;= 5.71.0, &lt; 6.0.0"
    }
  }
}</code></code></pre><p>Actually, the best can be written with <code>~&gt;</code> as well:</p><pre><code><code>terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~&gt; 5.71"
    }
  }
}</code></code></pre><p>This is great because it allows the minor and patch version numbers to slide forward but not the major version. Most reputable providers, such as AWS, follow semantic version (SemVer) standards very well, and they typically do not introduce breaking changes until a new <em>major</em> version is released.</p><p>It's your choice to use <code>~&gt;</code> or the more verbose <code>&gt;= x.y.z, &lt; a.b.c</code> but I've had people ask about the mysterious <code>~&gt;</code> syntax enough times that I favor simplicity--everyone looks at <code>&gt;= 5.71.0, &lt; 6.0.0</code> and understands it quite easily.</p><p>By using this constraint we ensure that the configuration can gracefully receive new features/fixes (minor and patch version upgrades) without breaking changes. To upgrade the provider version within this constraint, simply run <code>terraform init -upgrade</code>.</p><h2>Conclusion</h2><p>Let's wrap up. Terraform is powerful, but as Uncle Ben famously said, it comes with great responsibility.</p><p>We can make our lives, and our coworkers' lives, a lot easier by applying the principles from this guide to the Terraform we write. Consider those pain points the next time you find yourself modifying old Terraform code or doing a deployment and dreading the large, scary Terraform plan. I hope you will think back to these tips and find them useful!</p><blockquote><p><em>10/14/2024: <strong>Part 2</strong> is out now! Get even more, better, practical Terraform tips with Part 2 <a href="https://open.substack.com/pub/makewithdata/p/practical-terraform-youre-doing-it-b97?r=35j0a3&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">here</a></em>!</p></blockquote><p>Follow for more content like this!</p>]]></content:encoded></item><item><title><![CDATA[Hacking My Traeger Grill]]></title><description><![CDATA[Today I'll show you how I was able to hack my Traeger grill to capture data such as the grill temperature, pellet level, probe goal, and more, as well as feed that data to a database and visualize with a Grafana dashboard all within my own homelab.]]></description><link>https://www.makewithdata.tech/p/hacking-my-traeger-grill</link><guid isPermaLink="false">https://www.makewithdata.tech/p/hacking-my-traeger-grill</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 08 Oct 2024 05:53:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!83gB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!83gB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!83gB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!83gB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!83gB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!83gB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1572992,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!83gB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!83gB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!83gB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!83gB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319622a9-aa23-4962-b9de-9a6f5db9c7e6_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a southerner, I love BBQ and smoking meats, and the Traeger is no muss no fuss. However, I'm also a nerd that loves doing things in my homelab. It started out early one morning after putting a beautiful pork butt on the smoker for some pulled pork. I went back in the house and was trying to think of something to do while it got started... and thus was born the idea to see what I could make with data from my Traeger Ironwood XL.</p><p>Today I'll show you how I was able to hack my Traeger grill to capture data such as the grill temperature, pellet level, probe goal, and more, as well as feed that data to a database and visualize with a Grafana dashboard all within my own homelab.</p><h1><strong>1. Information Gathering / Network Scanning</strong></h1><p>The first step in any hacking exercise is to gather as much information as possible to understand the target and what makes it tick. I'll break this down into 2 simple goals initially:</p><ol><li><p>Identify the grill's IP address</p></li><li><p>Scan the grill for open ports and software versions</p></li></ol><p>I was able to find the IP address of my grill easily by visiting my router, or gateway, web UI (commonly located at <a href="http://192.168.1.1">http://192.168.1.1</a>) and viewing the attached devices, which listed the Traeger grill quite obviously: <strong>192.168.1.19</strong>. Nice, moving on.</p><p>Next, I used <code>nmap</code> to scan for open ports and version info: </p><pre><code><code>nmap -sV -p 1-9999 192.168.1.19</code></code></pre><p>I didn't find anything open :(</p><p>Doing some searching about Traeger's data protocols online I found a <a href="https://aws.amazon.com/partners/success/traeger-grills-ost/">customer success story</a> on AWS about how Traeger migrated to use AWS IoT, which uses <a href="http://mqtt.org/">MQTT</a> (Message Queuing Telemetry Transport).</p><h1><strong>2. Intercepting Data from the Grill</strong></h1><p>Knowing the grill likely uses AWS, it would explain why I couldn't find any open ports on it. I often use the Traeger mobile app to monitor and control my smoker while cooking, so the mobile app is most likely talking to a server in AWS, and so is the grill.</p><p>I wanted to test this theory using a Man-in-the-Middle (MITM) proxy. Essentially we just run a proxy to capture all HTTP(S) traffic and then configure our device (mobile phone or PC) to use the proxy. You could do this using <a href="https://docs.mitmproxy.org/stable/">mitmproxy</a> on one machine and then configure the mobile phone to use the proxy.</p><p>I noticed there was also a desktop Traeger app for Mac OS so I opted to try that just so I could stay on one device for my testing. So I installed the Traeger app for Mac OS, fired up <a href="https://www.wireshark.org/">Wireshark</a> and began capturing packets on my Mac's wireless network interface <code>en0</code> . While that's running, I signed in on the Traeger app and clicked around a few times to give it enough changes to fire off requests, then stopped the packet capture in Wireshark. With some luck you can find DNS requests like these:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nmVB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nmVB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 424w, https://substackcdn.com/image/fetch/$s_!nmVB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 848w, https://substackcdn.com/image/fetch/$s_!nmVB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 1272w, https://substackcdn.com/image/fetch/$s_!nmVB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nmVB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png" width="1456" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:349819,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nmVB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 424w, https://substackcdn.com/image/fetch/$s_!nmVB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 848w, https://substackcdn.com/image/fetch/$s_!nmVB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 1272w, https://substackcdn.com/image/fetch/$s_!nmVB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb962dfba-66ba-459b-a535-3da09a9a698a_3370x570.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The app resolves the DNS record <code>mobile-api.iot.traegergrills.io </code>which is a CNAME record resolving to <code>1ywgyc65d1.execute-api.us-west-2.amazonaws.com</code>. This appears to be an API gateway based on the format of the inner domain name.</p><p>With some more sniffing, I discovered the following pieces of information:</p><ol><li><p>The app makes API calls to an API Gateway <code>mobile-api.iot.traegergrills.io</code>.</p></li><li><p>The API calls have a Bearer authentication token in the <code>Authorization</code> header.</p></li><li><p>Upon decoding that auth token with <a href="https://jwt.io">https://jwt.io</a> it appeared to be generated by AWS Cognito because the token payload has <code>"iss": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_..." </code>in it.</p></li><li><p>There are API requests to endpoints such as <code>/prod/mqtt-connections</code>, confirming the app does use MQTT.</p></li></ol><p>This all makes sense if the folks over at Traeger are using AWS and the AWS IoT service.</p><h1><strong>3. Calling the APIs</strong></h1><p>Now that we know how the app works, we should be able to write a script that emulates the app's behavior, and therefore access data from our grill programmatically.</p><p>Fortunately I found someone else has done this as well and written a Python script to integrate Traeger with HomeAssistant: <a href="https://github.com/sebirdman/traeger/blob/master/traeger.py">https://github.com/sebirdman/traeger/blob/master/traeger.py</a></p><p>With some minor tweaks I was able to reuse this code and see data pretty quickly:</p><pre><code>&#10095; python -i traeger.py
&gt;&gt;&gt; t = traeger('MY_EMAIL@EXAMPLE.COM', 'CHANGEME_PASSWORD', requests)
&gt;&gt;&gt; t.get_user_data()
{'userId': 'REDACTED', 'givenName': 'Zach', 'familyName': 'King', 'fullName': 'Zach King', 'email': 'REDACTED', 'username': 'REDACTED', 'cognito': 'REDACTED', 'urbanAirshipId': 'REDACTED', 'teams': [{'teamId': 'REDACTED', 'teamName': 'REDACTED', 'thingName': 'REDACTED', 'userId': '3d8477eb-0605-4f47-a8fc-44580b9de96d'}], 'things': [{'thingName': 'REDACTED', 'friendlyName': 'Sir-Ribs-A-Lot', 'deviceTypeId': '2205', 'userId': '3d8477eb-0605-4f47-a8fc-44580b9de96d', 'status': 'CONFIRMED', 'productId': '53059272-02e0-4c00-914e-83f0e65edb16'}]}
&gt;&gt;&gt;</code></pre><p>Then subscribe to updates from the MQTT broker:</p><pre><code>&gt;&gt;&gt; t.refresh_mqtt_url()
&gt;&gt;&gt; t.subscribe_to_grill_status()
&gt;&gt;&gt; t.grill_status
{'REDACTED': {'thingName': 'REDACTED', 'jobs': [{'jobId': 'Yos_Combined_IronwoodXL_REDACTED', 'thingName': 'REDACTED', 'currentStatus': {'status': 'REJECTED', 'timestamp': 1710372539}, 'isFinished': True, 'isInProgress': False}], 'status': {'acc': [], 'ambient': 99, 'connected': False, 'cook_id': '', 'cook_timer_complete': 0, 'cook_timer_end': 0, 'cook_timer_start': 0, 'current_cycle': 0, 'current_step': 0, 'errors': 0, 'grease_level': 0, 'grease_temperature': 0, 'grill': 147, 'grill_mode': 0, 'in_custom': 0, 'keepwarm': 0, 'pellet_level': 25, 'real_time': 0, 'seasoned': 1, 'server_status': 1, 'set': 325, 'smoke': 0, 'sys_timer_complete': 1, 'sys_timer_end': 0, 'sys_timer_start': 0, 'system_status': 99, 'time': 1728263173, 'ui': {'ambient_light': 0, 'screen_brightness': 0}, 'units': 1, 'uuid': 'FC0FE708F7DF', 'probe_con': 0, 'probe': 0, 'probe_set': 0, 'probe_alarm_fired': 0}, 'features': {'cold_smoke_enabled': 0, 'flame_sensor_enabled': 1, 'grease_sensor_enabled': 0, 'grill_light_enabled': 1, 'grill_mode_enabled': 0, 'lid_sensor_enabled': 1, 'limits': {'max_grill_temp': 500}, 'open_loop_mode_enabled': 0, 'pellet_sensor_connected': 1, 'pellet_sensor_enabled': 1, 'pizza_mode_enabled': 0, 'super_smoke_enabled': 1, 'ui': {'ui_type': 0}}, 'limits': {'max_grill_temp': 0}, 'settings': {'config_version': '2205.001', 'device_type_id': 2205, 'feature': 0, 'fw_build_num': 'c0321c9d-20230221_113102', 'fw_version': '01.03.21', 'language': 0, 'networking_fw_version': '1.4.2', 'rssi': -68, 'speaker': 1, 'ssid': 'REDACTED', 'ui_fw_build_num': 'e4ea0a2-20220715_205356', 'ui_fw_version': '01.03.12', 'units': 1}, 'usage': {'ac_ignitor': 10740, 'auger': 231583, 'cook_cycles': 18, 'dc_ignitor': 0, 'error_stats': {'auger_disco': 0, 'auger_ovrcur': 0, 'bad_thermocouple': 0, 'fan_disco': 0, 'ign_ac_disco': 0, 'ign_dc_disco': 0, 'ignite_fail': 0, 'low_ambient': 0, 'lowtemp': 0, 'overheat': 0}, 'fan': 329530, 'grease_trap_clean_countdown': 0, 'grill_clean_countdown': 0, 'hotrod': 0, 'light': 8494, 'runtime': 329711, 'time': 0, 'ui': {'screen_on': 0}}, 'custom_cook': {'cook_cycles': [{'slot_num': 4, 'populated': 0}]}, 'details': {'thingName': 'REDACTED', 'userId': '3d8477eb-0605-4f47-a8fc-44580b9de96d', 'lastConnectedOn': 1728240177, 'thingNameLower': 'fc0fe708f7df', 'friendlyName': 'Sir-Ribs-A-Lot', 'deviceType': '2205'}, 'stateIndex': 33112, 'schemaVersion': '2.0'}}
&gt;&gt;&gt;</code></pre><h1><strong>4. Sending the Data to Elasticsearch</strong></h1><p>Now we need to persist the data so we can later do some visualizations in Grafana. I chose Elasticsearch so let's modify the <code>def __init__()</code> constructor to initialize a connection to the database:</p><pre><code>def __init__(self, username, password, request_library, es_host): 
&#9;# Initialize Elasticsearch connection 
&#9;self.es_client = Elasticsearch([es_host])
&#9;...</code></pre><p>Then once the MQTT subscription is made, any time events come in from the grill, the<code> grill_message(&#8230;) </code>method will be invoked. So that's our hook we need to write to the database from:</p><pre><code>def grill_message(self, client, userdata, message): 
&#9;if message.topic.startswith("prod/thing/update/"): 
&#9;&#9;grill_id = message.topic[len("prod/thing/update/"):] 
&#9;&#9;
&#9;&#9;# Parse the grill status from the message 
&#9;&#9;grill_data = json.loads(message.payload) 
&#9;&#9;self.grill_status[grill_id] = grill_data 
&#9;&#9;
&#9;&#9;# Add a timestamp for Elasticsearch (Grafana uses this for time-series queries) 
&#9;&#9;grill_data['timestamp'] = datetime.utcnow().isoformat() 
&#9;&#9;
&#9;&#9;# Write the parsed data to Elasticsearch 
&#9;&#9;try: 
&#9;&#9;&#9;self.es_client.index(index='grill_status', id=grill_id, body=grill_data) 
&#9;&#9;&#9;logging.info(f"Grill data for {grill_id} written to Elasticsearch successfully.") 
&#9;&#9;except Exception as e: 
&#9;&#9;&#9;logging.error(f"Failed to write grill data to Elasticsearch: {e}") 
&#9;&#9;
&#9;&#9;# Trigger any callbacks registered for this grill ID 
&#9;&#9;if grill_id in self.grill_callbacks: 
&#9;&#9;&#9;for callback in self.grill_callbacks[grill_id]: 
&#9;&#9;&#9;&#9;callback()</code></pre><h1><strong>5. Visualizing the Data</strong></h1><p>Finally, we can use our data from Elasticsearch as a data source in Grafana and create a dashboard to put it all together.</p><p>I haven't gotten to this quite yet, it's on my TODO list while migrating part of my homelab to a Kubernetes cluster running on mini-PCs (follow my newsletter and let me know if you're interested in a series on building that homelab!). However, the dashboard should contain some graphs such as grill temperature, probe 1 temperature, and probe 2 temperature. It should also contain a gauge indicating the remaining pellet level from 0-100%, and a few single stat indicators for Super Smoke mode, probe goals, and more depending on the specific grill features.</p><p>The following is a basic example to illustrate the idea (credit: user/natty_patty from the r/homelab subreddit):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JmCi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JmCi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 424w, https://substackcdn.com/image/fetch/$s_!JmCi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 848w, https://substackcdn.com/image/fetch/$s_!JmCi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 1272w, https://substackcdn.com/image/fetch/$s_!JmCi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JmCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png" width="828" height="1321" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1321,&quot;width&quot;:828,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319987,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JmCi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 424w, https://substackcdn.com/image/fetch/$s_!JmCi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 848w, https://substackcdn.com/image/fetch/$s_!JmCi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 1272w, https://substackcdn.com/image/fetch/$s_!JmCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5d8abd-d7d0-4d7f-920c-73e022cfb3de_828x1321.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1><strong>Conclusion</strong></h1><p>This was a fun dive into IoT apps and learning how yet another device around the house works. You may be surprised what you find simply by scanning your network and investigating API calls.</p><p>I hope you found this useful or at least entertaining. Please consider sharing and following my newsletter if you like this content as it motivates me to write more! Let me know in the comments if you have an interesting IoT data story, what challenges did you face?</p>]]></content:encoded></item><item><title><![CDATA[Data Engineering: Essential Skills and Tools for 2025]]></title><description><![CDATA[Discover the essential data engineering skills and tools for 2025, including Python, SQL, Apache Spark, Kafka, Delta Lake, cloud platforms, real-time streaming, and automation.]]></description><link>https://www.makewithdata.tech/p/data-engineering-essential-skills-2025</link><guid isPermaLink="false">https://www.makewithdata.tech/p/data-engineering-essential-skills-2025</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Fri, 27 Sep 2024 02:49:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70751bc-402f-4ebe-bd35-7cd5e8239d0c_793x793.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data engineering is now crucial for all organizations, not just tech companies, as they seek to harness data and AI. A solid foundation in data engineering is key to unlocking that potential.</p><p>This guide covers the essential skills, tools, and technologies for becoming an effective data engineer. Whether you're just starting or looking to deepen your expertise, this is your roadmap to modern data engineering.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.makewithdata.tech/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading MakeWithData! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Programming Languages</h2><h3>SQL</h3><p>Python and SQL are must-haves for both aspiring and experienced data engineers. SQL, especially in its ANSI-compliant forms, is widely used due to its simplicity and expressiveness, making it a fundamental skill.</p><h3>Python</h3><p><a href="https://www.python.org/">Python</a> remains the top choice in the data and AI world. Its simple syntax and powerful built-in features make it ideal for scripting and data manipulation. With a vast array of libraries (e.g. pandas, spark, scikit-learn, pytorch) for data analysis, machine learning, and big data, Python continues to lead the pack in data engineering and AI.</p><h3>Scala (Bonus)</h3><p><a href="https://www.scala-lang.org/">Scala</a> is an elegant cross between functional and object-oriented programming languages, and runs on the Java Virtual Machine (JVM). Great for robust and large enterprise projects, and still a major footprint in big data frameworks like Apache Spark, don&#8217;t sleep on this programming language.</p><h2>Key Frameworks and Tools</h2><h3>Apache Spark</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BjgK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BjgK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 424w, https://substackcdn.com/image/fetch/$s_!BjgK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 848w, https://substackcdn.com/image/fetch/$s_!BjgK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 1272w, https://substackcdn.com/image/fetch/$s_!BjgK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BjgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png" width="318" height="165.1153846153846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f35d2648-476b-46f2-8d9d-696c69955edb_312x162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:162,&quot;width&quot;:312,&quot;resizeWidth&quot;:318,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;File:Apache Spark logo.svg - Wikimedia Commons&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="File:Apache Spark logo.svg - Wikimedia Commons" title="File:Apache Spark logo.svg - Wikimedia Commons" srcset="https://substackcdn.com/image/fetch/$s_!BjgK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 424w, https://substackcdn.com/image/fetch/$s_!BjgK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 848w, https://substackcdn.com/image/fetch/$s_!BjgK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 1272w, https://substackcdn.com/image/fetch/$s_!BjgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff35d2648-476b-46f2-8d9d-696c69955edb_312x162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: https://spark.apache.org</figcaption></figure></div><p>First and most importantly as far as frameworks go there is <a href="https://spark.apache.org/">Apache Spark</a>. Spark is an open-source analytics and big data processing engine that leverages a distributed computing architecture.</p><p>Spark supports Python, Scala, SQL, and R programming languages. You can deploy Spark applications on a variety of cloud computing services such as Amazon EMR, or deploy it yourself with its native Kubernetes support. Spark clusters can be anywhere from 1 to thousands of nodes, and can process data up to petabytes in scale.</p><pre><code>&gt;&gt;&gt; from pyspark.sql.functions import col, when

&gt;&gt;&gt; df1 = df.withColumn(
    "life_stage",
    when(col("age") &lt; 13, "child")
    .when(col("age").between(13, 19), "teenager")
    .otherwise("adult"),
)

&gt;&gt;&gt; df1.groupBy("life_stage").avg().show()
+----------+--------+
|life_stage|avg(age)|
+----------+--------+
|     adult|    53.5|
|     child|     3.0|
|  teenager|    13.0|
+----------+--------+</code></pre><blockquote><p><strong>Note: </strong>PySpark also has strong interoperability with pandas, arrow, and koalas dataframes, making it a powerhouse for data science and machine learning with big data.</p></blockquote><h3>PyTorch</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cii3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cii3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 424w, https://substackcdn.com/image/fetch/$s_!Cii3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 848w, https://substackcdn.com/image/fetch/$s_!Cii3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 1272w, https://substackcdn.com/image/fetch/$s_!Cii3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cii3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png" width="1456" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;File:PyTorch logo black.svg - Wikipedia&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="File:PyTorch logo black.svg - Wikipedia" title="File:PyTorch logo black.svg - Wikipedia" srcset="https://substackcdn.com/image/fetch/$s_!Cii3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 424w, https://substackcdn.com/image/fetch/$s_!Cii3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 848w, https://substackcdn.com/image/fetch/$s_!Cii3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 1272w, https://substackcdn.com/image/fetch/$s_!Cii3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ceaafd9-376b-451f-87e4-4c2dbed4e3f6_2560x635.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: https://pytorch.org</figcaption></figure></div><p><a href="https://pytorch.org/">PyTorch</a> is a leading deep learning framework, ideal for both research and production environments. PyTorch is extremely relevant for ML and data engineers working on AI-driven data pipelines, model training, and model inference. </p><p>Its support for dynamic computation graphs allows for efficient processing of complex data models, while libraries like <a href="https://pytorch.org/serve/">TorchServe</a> simplify deploying ML models. Don&#8217;t think ML frameworks like PyTorch are only for data scientists and ML engineers; data engineers will stand apart from the crowd when equipped with even a basic understanding of ML concepts and popular frameworks like PyTorch.</p><p><a href="https://www.learnpytorch.io/">Start learning PyTorch today with "Learn PyTorch for Deep Learning: Zero to Mastery" free online book</a></p><h3>Streams: Kafka, Kinesis</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qm8x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qm8x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 424w, https://substackcdn.com/image/fetch/$s_!Qm8x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 848w, https://substackcdn.com/image/fetch/$s_!Qm8x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 1272w, https://substackcdn.com/image/fetch/$s_!Qm8x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qm8x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png" width="225" height="225" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:225,&quot;width&quot;:225,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Apache Kafka&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Apache Kafka" title="Apache Kafka" srcset="https://substackcdn.com/image/fetch/$s_!Qm8x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 424w, https://substackcdn.com/image/fetch/$s_!Qm8x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 848w, https://substackcdn.com/image/fetch/$s_!Qm8x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 1272w, https://substackcdn.com/image/fetch/$s_!Qm8x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cca964-3c2b-4217-9dbc-3585662921b5_225x225.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: https://kafka.apache.org</figcaption></figure></div><p>Many business use cases demand data and updates real-time. Streaming tools like <a href="https://kafka.apache.org/">Apache Kafka</a> and <a href="https://aws.amazon.com/kinesis/">Amazon Kinesis</a> are essential tools for modern data engineering. Kafka is widely used for its ability to handle high-throughput, real-time data streams in distributed systems&#8212;especially with Kafka being open-source and cloud agnostic.</p><p>Amazon Kinesis is a fully managed streaming service, with similar capabilities plus integrations into the AWS ecosystem. Both platforms are invaluable for applications requiring low-latency data processing and event-driven applications, such as IoT, financial systems, and fraud detection, positioning them as critical tools for data engineers navigating real-time data challenges.</p><blockquote><p><strong>Note: </strong>while streaming engines like Kafka aren&#8217;t going away anytime soon, they are often selected too hastily as the required tool for the job. With modern data formats (e.g. Delta Lake) and query engines (e.g. Spark), I would encourage you not to underestimate the capability of simpler micro-batch processing too.</p></blockquote><h3>Delta Lake</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zgpc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zgpc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 424w, https://substackcdn.com/image/fetch/$s_!zgpc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 848w, https://substackcdn.com/image/fetch/$s_!zgpc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 1272w, https://substackcdn.com/image/fetch/$s_!zgpc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zgpc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png" width="370" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5054bd63-df74-4057-8028-588acddd3f9f_370x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Understanding the Delta Lake Transaction Log - Databricks Blog&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Understanding the Delta Lake Transaction Log - Databricks Blog" title="Understanding the Delta Lake Transaction Log - Databricks Blog" srcset="https://substackcdn.com/image/fetch/$s_!zgpc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 424w, https://substackcdn.com/image/fetch/$s_!zgpc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 848w, https://substackcdn.com/image/fetch/$s_!zgpc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 1272w, https://substackcdn.com/image/fetch/$s_!zgpc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5054bd63-df74-4057-8028-588acddd3f9f_370x302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: https://delta.io</figcaption></figure></div><p><a href="https://delta.io/">Delta Lake</a> is an open-source storage layer that brings reliability and performance to data lakes. It allows data engineers to build scalable and robust pipelines by adding ACID transactions and schema enforcement to the flexibility of data lakes. </p><p>Delta Lake enables incremental data processing with time-travel capabilities, making it easy to track changes and roll back to previous versions of data. It also optimizes data lakes with features like compaction, z-order indexing, and liquid clustering, ensuring high performance for queries. </p><p>As data lakes grow in size and complexity, Delta Lake is essential for ensuring consistency, reliability, and efficiency in managing large-scale datasets.</p><p>Through the <a href="https://delta.io/integrations/">Delta Kernel project</a> you can now use Delta Lake through a wide variety of programming languages and tools, such as Rust, Go, Python, Flink, Power BI, and many more, making it an even more versatile engine even in smaller datasets.</p><h3>Databricks</h3><p>Remember all the previous tools, frameworks, and languages we just covered? If you could take all of those and so much more, bundle it up into a single unified platform, that platform would be <a href="https://www.databricks.com/">Databricks</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nyc_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nyc_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 424w, https://substackcdn.com/image/fetch/$s_!nyc_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 848w, https://substackcdn.com/image/fetch/$s_!nyc_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 1272w, https://substackcdn.com/image/fetch/$s_!nyc_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nyc_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png" width="916" height="854" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:854,&quot;width&quot;:916,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;data inteligence engine &quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="data inteligence engine " title="data inteligence engine " srcset="https://substackcdn.com/image/fetch/$s_!nyc_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 424w, https://substackcdn.com/image/fetch/$s_!nyc_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 848w, https://substackcdn.com/image/fetch/$s_!nyc_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 1272w, https://substackcdn.com/image/fetch/$s_!nyc_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc400506-a208-42a7-862b-90b07b5d5eaf_916x854.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: https://www.databricks.com/product/data-intelligence-platform</figcaption></figure></div><p>Databricks truly does unify the experience and solutions for data engineering, data analysis, data science, ML engineer / MLOps, and BI and reporting. Founded by original creators of Apache Spark, Databricks earned its respect initially as a highly efficient proprietary runtime for Spark, and made notebooks and jobs simpler and more collaborative than ever. Since then Databricks has continued to add new products to the platform including several favorites such as:</p><ul><li><p>Notebooks with support for Python, Scala, SQL, and R.</p></li><li><p>Jobs / Workflows with advanced orchestration features.</p></li><li><p>SQL Dashboards.</p></li><li><p>Machine Learning model training, evaluation, logging, and model serving.</p></li><li><p>Gen AI development, evaluation, serving, and guardrails.</p></li><li><p>Data Governance through Unity Catalog with access control, row-level and column-level security, data lineage, and attribute-based access control (ABAC).</p></li><li><p>Serverless jobs, SQL, and model serving</p></li></ul><p>Databricks is multi-cloud, supporting AWS, GCP, and partnered with Microsoft Azure. Note: some features may be more available on AWS and Azure than GCP.</p><div><hr></div><h2>Conclusion</h2><p>There are many more tools I couldn&#8217;t list for risk of making this too long; the list above are some of my top picks, especially in 2024/2025. </p><p>Several other worthy mentions:</p><ul><li><p>Apache Airflow</p></li><li><p>Iceberg (similar to Delta Lake)</p></li><li><p>DLT</p></li><li><p>Apache Airflow</p></li><li><p>DuckDB</p></li><li><p>MLFlow</p></li><li><p>Terraform</p></li><li><p>Kubernetes</p></li></ul><p>Thank you for reading and if you liked this content please consider subscribing and sharing the post with friends and colleagues!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.makewithdata.tech/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading MakeWithData! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[UniForm: Peace to Delta Lake + Iceberg + Hudi]]></title><description><![CDATA[In case you haven&#8217;t been paying attention the last few years, there is a war being waged&#8230; a war of which open table data format will come out on top: Delta Lake, Iceberg, and Hudi.]]></description><link>https://www.makewithdata.tech/p/uniform-peace-to-delta-lake-iceberg-hudi</link><guid isPermaLink="false">https://www.makewithdata.tech/p/uniform-peace-to-delta-lake-iceberg-hudi</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Mon, 24 Jun 2024 02:34:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1a1acf56-ef82-489e-b08a-28ccc1b79515_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DPjk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DPjk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!DPjk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!DPjk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!DPjk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DPjk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:397328,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DPjk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!DPjk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!DPjk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!DPjk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b0123d7-2fc5-44f0-9e54-8ba62299f95b_1024x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>In case you haven&#8217;t been paying attention the last few years, there is a war being waged&#8230; a war of which open table data format will come out on top: Delta Lake, Iceberg, and Hudi. These data formats each offer similar and unique features to manage and scale big data lakes, and so the choice of which format to use has been a hot topic among the data engineering community for quite some time.</p><p>Organizations looking to maximize the value and scalability of their data lakes are really the ones the draw the short straw. No one wants to be torn between multiple tools, but companies also want to ensure the choice they make doesn&#8217;t become obsolete or turn out to be wrong, only to have to perform a complex migration. Fortunately, the table format wars can finally resolve in peace thanks to UniForm by Databricks.</p><h3>Background: Iceberg vs. Delta Lake</h3><p>Delta Lake and Iceberg have been the dominate formats in this field for a while, so we&#8217;ll focus on these:</p><h4>Delta Lake</h4><ul><li><p>Developed by Databricks, open-sourced in <a href="https://www.databricks.com/company/newsroom/press-releases/databricks-open-sources-delta-lake-for-data-lake-reliability#:~:text=April%2024%2C%202019&amp;text=San%20Francisco%20%E2%80%94%20April%2024%2C%202019,deliver%20reliability%20to%20data%20lakes.">April, 2019</a>.</p></li><li><p>Offers ACID transactions for data integrity and consistency guarantees.</p></li><li><p>Maintains metadata to support scalable operations and versioning, distinct from data files which use the Parquet format.</p></li><li><p>Versions data and enables time travel queries to view historical data easily.</p></li><li><p>Downloaded &gt;20M times per month as of the time of this writing.</p></li><li><p><a href="https://delta.io/blog/delta-kernel/">Delta Kernel</a> is a set of libraries for the core Delta Lake logic, allowing users to operate on Delta Lake tables from almost any language or engine (Java, Python, C++, Rust, Spark, Trino, Pandas, Polars, DuckDB, etc.) without re-writing the core behaviors.</p></li><li><p>Project home page: <a href="https://delta.io/">https://delta.io/</a></p></li></ul><h4>Iceberg</h4><ul><li><p>Developed by Netflix, donated to Apache Software Foundation in 2018.</p></li><li><p>Offers ACID transactions for data integrity and consistency guarantees.</p></li><li><p>Also maintains metadata and data files separately, enabling scalable operations on large tables.</p></li><li><p>Versions data and enables time travel to view historical data. Also borrows popular concepts from version control (e.g. Git) such as cherry-picking, branches/tags, and merges.</p></li><li><p>Supports a wide range of engines (Spark, Trino, Flink, Amazon Athena, etc.) and maintains these implementations within the main repository of Iceberg itself except for a few exceptions such as Trino and Presto. See <a href="https://iceberg.apache.org/multi-engine-support/">https://iceberg.apache.org/multi-engine-support/</a></p></li><li><p>Project home page: <a href="https://iceberg.apache.org/">https://iceberg.apache.org/</a></p></li></ul><p>For years these formats have competed for dominance in the data lake ecosystem. Businesses often had to choose between them, hedging their bets against a particular format and set of languages and engines deemed better than the rest (or at least lower effort depending on the organization&#8217;s existing tech stack). This division created challenges in terms of interoperability, data governance, subject matter expertise, and overall data lake management.</p><h4>UniForm, our savior &#128588;</h4><p>Introduce <a href="https://docs.databricks.com/en/delta/uniform.html">UniForm</a> (Universal Format), a feature offered by Databricks to bring peace. UniForm aims to unify the interoperability of the Delta Lake, Apache Iceberg, and Apache Hudi data formats by capitalizing on the fact that all 3 have the same fundamental trait: parquet data files combined with a format-specific metadata. Because the metadata is the only real difference, UniForm compromises the format wars by allowing users to automatically and asynchronously generate the metadata for all of the data formats (Delta, Iceberg, Hudi), or only the ones you wish to enable in the event that you only care about 2 out of the 3.</p><p>It&#8217;s simple, when creating a table in Databricks you just set a property to enable the formats you want metadata for:</p><pre><code>CREATE TABLE main.sales.skus (sku_name STRING, inserted_at TIMESTAMP)
TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg,hudi');</code></pre><p>Now, for engines that may only support Iceberg and not Delta Lake for example, you can access this table as if it were just an Iceberg table (and vice versa). You can also begin using UniForm on existing tables, the property does not have to be set at the time of creation:</p><pre><code>ALTER TABLE table_name SET TBLPROPERTIES ('delta.universalFormat.enabledFormats' = 'iceberg');</code></pre><h4>Databricks acquires Tabular</h4><p>As a matter of fact, this is old news&#8211;Databricks began offering UniForm in 2023? So what&#8217;s all the buzz, aren&#8217;t we at peace already? Well no, not exactly. These changes take time, both from the technology and community perspective. From a community perspective, this new feature takes time to recognize and adopt, even though it is incredibly easy to get started with. From a technology perspective, it is very challenging and time-consuming for a single project (Delta Lake) to now try to reconcile all the ongoing changes and features from the other two formats, which in niche cases may create a parity gap where in fact it feels like your faux Iceberg table does <strong>not</strong> integrate successfully.</p><p>On <a href="https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-tabular-company-founded-original-creators">June 4, 2024</a>, Databricks announced it had agreed to acquire Tabular. Tabular is a data management company founded by Ryan Blue, Daniel Weeks, and Jason Reid, among who were also the original creators of Apache Iceberg at Netflix. According to Databricks in the press release:</p><blockquote><p>Databricks intends to work closely with the Delta Lake and Iceberg communities to bring format compatibility to the lakehouse; in the short term, inside Delta Lake UniForm and in the long term, by evolving toward a single, open, and common standard of interoperability. Databricks and Tabular will work together towards a joint vision of the open lakehouse.</p><p>Credit: <a href="https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-tabular-company-founded-original-creators">Databricks Press Release</a></p></blockquote><p>Furthermore, Databricks CEO, Ali Ghodsi, along with Ryan Blue, shares at the 2024 Data + AI Summit keynote that this strategic decision is intended specifically to combine their technical subject matter expertise between the two leading formats, Delta Lake and Iceberg, such that we can expect the unilateral development within UniForm to become more and more ubiquitous.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LB9-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LB9-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 424w, https://substackcdn.com/image/fetch/$s_!LB9-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 848w, https://substackcdn.com/image/fetch/$s_!LB9-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 1272w, https://substackcdn.com/image/fetch/$s_!LB9-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LB9-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png" width="1024" height="596" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:596,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LB9-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 424w, https://substackcdn.com/image/fetch/$s_!LB9-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 848w, https://substackcdn.com/image/fetch/$s_!LB9-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 1272w, https://substackcdn.com/image/fetch/$s_!LB9-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8984d486-a4b6-41e8-a560-f4477e20a94c_1024x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://youtu.be/uB0n4IZmS34?feature=shared&amp;t=1380">Data+AI Summit 2024 Keynote</a> talk by Databricks CEO, Ali Ghodsi (left), and Tabular&#8217;s Ryan Blue (right).</figcaption></figure></div><p>For this, these data technology giants have my respect as I believe the long-term value is openness and maximum flexibility for all. What a breath of fresh air after countless hours of the normal one-sided marketing pitches that lead to vendor lock-in.</p><p></p>]]></content:encoded></item><item><title><![CDATA[🍷FineWeb: the new Pile 🤔]]></title><description><![CDATA[FineWeb is a large-scale web corpus created by Hugging Face to train state-of-the-art LLMs but how does it compare to ThePile?]]></description><link>https://www.makewithdata.tech/p/fineweb-the-new-pile</link><guid isPermaLink="false">https://www.makewithdata.tech/p/fineweb-the-new-pile</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Thu, 02 May 2024 03:20:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j-kT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j-kT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!j-kT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!j-kT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!j-kT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j-kT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149454,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j-kT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!j-kT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!j-kT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!j-kT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550242f3-3d22-46b9-a1b5-b38f283592c0_1024x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">caption...</figcaption></figure></div><h2>What is FineWeb?</h2><p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">FineWeb</a> is a large-scale web corpus dataset created by Hugging Face to train state-of-the-art Large Language Models (LLMs). Of course FineWeb is massive: it contains over 15 trillion tokens of cleaned and deduplicated English web data from the <a href="https://commoncrawl.org/">CommonCrawl</a> project.</p><p>According to Hugging Face, their dataset &#8220;was originally meant to be a fully open replication of &nbsp;<a href="https://huggingface.co/papers/2306.01116">RefinedWeb</a>, with a release of the&nbsp;<strong>full dataset</strong>&nbsp;under the&nbsp;<strong>ODC-By 1.0 license</strong>. However, by carefully adding additional filtering steps,[they] managed to push the performance of FineWeb well above that of the original RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama) on our aggregate group of benchmark tasks.&#8221;<a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">[1]</a></p><h2>FineWeb vs. RefinedWeb vs. The Pile</h2><ul><li><p><strong>Size and Scale</strong>: FineWeb contains over <strong>15 trillion tokens</strong> and has a downloadable size of 45 TB, whereas RefinedWeb only contains about 5 trillion tokens / 1.68 TB, and The Pile with about 1.35 trillion tokens / 886 GB.</p></li><li><p><strong>Data Cleaning: </strong>the team at Hugging Face focused on high quality web pages by using advanced filtering algorithms to remove spam, duplicate data, and low-quality pages.</p><ul><li><p><a href="https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/formatters/pii.py">PII Formatting</a> is also used to anonymize email addresses and public IP addresses scraped from web pages.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Jmz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Jmz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 424w, https://substackcdn.com/image/fetch/$s_!7Jmz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 848w, https://substackcdn.com/image/fetch/$s_!7Jmz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 1272w, https://substackcdn.com/image/fetch/$s_!7Jmz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Jmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png" width="1024" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Line graph of FineWeb benchmarked against other web corpus datasets like C4, Dolma, RefineWeb, SlimPajama, and The Pile.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Line graph of FineWeb benchmarked against other web corpus datasets like C4, Dolma, RefineWeb, SlimPajama, and The Pile." title="Line graph of FineWeb benchmarked against other web corpus datasets like C4, Dolma, RefineWeb, SlimPajama, and The Pile." srcset="https://substackcdn.com/image/fetch/$s_!7Jmz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 424w, https://substackcdn.com/image/fetch/$s_!7Jmz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 848w, https://substackcdn.com/image/fetch/$s_!7Jmz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 1272w, https://substackcdn.com/image/fetch/$s_!7Jmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96803845-bd0a-4785-8f9f-6495dea8e782_1024x698.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: https://huggingface.co/datasets/HuggingFaceFW/fineweb</figcaption></figure></div><h3>Data Sources</h3><p>So it comes from the Internet but how did HuggingFace obtain the data exactly?</p><ul><li><p><strong>Dynamic Web Scraping: </strong>FineWeb uses more sophisticated web crawling techniques that prioritize higher quality content such as authoritative sources, recent content, and filtering out harmful content.</p></li><li><p><strong>Expert Vetting: </strong>the dataset was ran on the <a href="https://github.com/huggingface/datatrove/">datatrove</a> library and went through a process of expert vetting and analysis to determine the optimal deduplication approach, rather than just applying a single deduplication method across the entire dataset.&nbsp;</p></li><li><p><strong>Community Feedback Loop: </strong>The creators at Hugging Face have indicated they intend to iteratively improve the dataset over time, through surveys, polls, open-ended discussions, and direct interactions</p></li></ul><h3>What does FineWeb contain?</h3><p>Let&#8217;s take a look at the general makeup of FineWeb&#8217;s dataset.</p><ul><li><p>The dataset is based on internet crawls between the Summer of 2013 and the Winter of 2024.<a href="https://dev.to/maximsaplin/fineweb-45tb-dataset-500k-gpu-costs-and-adult-content-improving-llm-quality-521g">[2]</a></p></li><li><p>It was created by processing and distilling 38,000 TB of CommonCrawl dumps into a 45 TB dataset ready for language model training. <a href="https://dev.to/maximsaplin/fineweb-45tb-dataset-500k-gpu-costs-and-adult-content-improving-llm-quality-521g">[2]</a></p></li><li><p>The dataset contains multiple configurations allowing you to load the default or a more specific dump/crawl such as CC-MAIN-2023-50. See <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl">here</a> for the table of dumps/crawls you can selectively load.</p></li><li><p>Contains a variety of text from academic papers, books, news articles, blogs, forums, and more, aiming for a holistic representation of language use.</p><ul><li><p>Also includes structured data types like tables and lists to train models on data extraction and interpretation tasks.</p></li></ul></li></ul><h3>Use Cases &amp; Applications</h3><p>One of the things I love about FineWeb is the considerations taken to try to anonymize Personally Identifiable Information (PII) and avoid toxicity or biased content. We can certainly debate whether this is the dataset&#8217;s responsibility to solve. I like having the flexibility to know this data has received that treatment, and ultimately it gives the community another fantastic corpus to train with in the NLP and LLM applications.</p><p>FineWeb covers a very wide range of topics and styles, with high quality. One use case we may begin to see is FineWeb used to improve benchmarking and evaluation of language models, adding more comprehensive representation to existing benchmarks.</p><p>If you are considering FineWeb for coding tasks, please note the known limitation shared on the README:</p><blockquote><p>As a consequence of some of the filtering steps applied, it is likely that code content is not prevalent in our dataset. If you are training a model that should also perform code tasks, we recommend you use FineWeb with a code dataset, such as&nbsp;<a href="https://huggingface.co/datasets/bigcode/the-stack-v2">The Stack v2</a>.</p><p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb#other-known-limitations">[3] Source</a></p></blockquote><p>I&#8217;m excited to see what new use cases and advancements this will bring to the GenAI community; in fact dozens of open source models have already been registered having been trained on FineWeb: https://huggingface.co/models?dataset=dataset:HuggingFaceFW/fineweb</p><p>The post <a href="https://makewithdata.tech/%f0%9f%8d%b7fineweb-the-new-pile-%f0%9f%a4%94/">&#127863;FineWeb: the new Pile &#129300;</a> appeared first on <a href="https://makewithdata.tech">MakeWithData</a>.</p>]]></content:encoded></item><item><title><![CDATA[From Beginner to Certified Azure Administrator in Two Weeks]]></title><description><![CDATA[Whether you&#8217;re a cloud veteran on Azure, AWS, GCP, or just getting started on your cloud computing and solution architecture journey, official certifications are still a great way to learn and improve your skill set.]]></description><link>https://www.makewithdata.tech/p/azure-admin-certified-in-two-weeks</link><guid isPermaLink="false">https://www.makewithdata.tech/p/azure-admin-certified-in-two-weeks</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Wed, 10 Apr 2024 02:14:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4WKL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4WKL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 424w, https://substackcdn.com/image/fetch/$s_!4WKL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 848w, https://substackcdn.com/image/fetch/$s_!4WKL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 1272w, https://substackcdn.com/image/fetch/$s_!4WKL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4WKL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp" width="720" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:720,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4WKL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 424w, https://substackcdn.com/image/fetch/$s_!4WKL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 848w, https://substackcdn.com/image/fetch/$s_!4WKL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 1272w, https://substackcdn.com/image/fetch/$s_!4WKL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0b37bd-ceff-47a2-8f62-b52d3aa5a12b_720x720.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Whether you&#8217;re a cloud veteran on Azure, AWS, GCP, or just getting started on your cloud computing and solution architecture journey, official certifications are still a great way to learn and improve your skill set.</p><p>I actually often hear questions like &#8220;are certifications even still useful or required to get a job?&#8221; Honestly, your mileage will vary just like it does in today&#8217;s debate about college degrees being a prerequisite to technical jobs &#8212; it all depends on the employer and the individual. If you find yourself turned down purely based on the lack of one of these artifacts <strong>alone</strong>, that&#8217;s probably a red flag that you&#8217;re in the wrong place to begin with (portfolios and prior experience are often just as important, if not more).</p><p>However, I&#8217;m a firm believer that certifications are an excellent form of curated learning and development no matter what your opinion of standardized testing is.</p><p>I recently acquired my first Azure certification: Azure Administrator Associate (AZ-104), after <strong>less than 2 weeks</strong> of preparation. I have 7 years of experience with AWS and a focus on data engineering, DevOps, and solution architecture; I wanted to broaden my overall cloud expertise by learning Azure, the 2nd-most leading cloud provider close behind AWS. My goal: to learn what a Solution Architect on Azure needs to know, or acquire the <a href="https://learn.microsoft.com/en-us/credentials/certifications/exams/az-305/">AZ-305</a> <em>Azure Solutions Architect Expert credential</em>. Before reaching the AZ-305 though, there&#8217;s a prerequisite exam required by Microsoft, the <a href="https://learn.microsoft.com/en-us/credentials/certifications/exams/az-104/">AZ-104</a>: <em>Azure Administrator Associate</em>, so that&#8217;s where my story begins.</p><h2>Study Materials and Resources</h2><p>I didn&#8217;t exactly <em>plan</em> to take the exam with only two weeks of preparation, but I&#8217;m comfortable taking tests, have a decent memory, and let&#8217;s face it &#8212; pretty impatient. So what did I study in those 14 days?</p><p>Mostly two things:</p><ol><li><p><a href="https://www.youtube.com/watch?v=0Knf9nub4-k">John Savill&#8217;s study cram video for AZ-104</a></p></li><li><p>Microsoft Learn website&#8217;s learning paths listed on the <a href="https://learn.microsoft.com/en-us/credentials/certifications/exams/az-104/">exam page</a></p></li></ol><p>I also took the practice assessment on the AZ-104 exam page linked above; I did this once before I did ANY studying, to get a baseline measurement of my knowledge based on context clues and similar knowledge from AWS. I took the practice exam a few more times after all my studying, until I scored a 98% on it and felt confident.</p><blockquote><p>Note: the practice assessment on Microsoft Learn does repeat the same questions a lot after 2 or 3 attempts so keep this in mind.</p></blockquote><h2>Study Cram Video</h2><p>John Savill is a very well-known instructor for Azure and you&#8217;ll quickly see his name pop up in search results and Reddit threads when looking for study material. What I love is his study cram, a VERY long 4-hour video covering a high-level of every Azure service and topic in the exam. Unfortunately for me, he released a V2 of the AZ-104 video just a few days before my exam was scheduled, but is a great testament to his commitment to stay up-to-date.</p><div id="youtube2-0Knf9nub4-k" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;0Knf9nub4-k&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/0Knf9nub4-k?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><blockquote><p>Note: if you&#8217;re looking for a similar instructor for AWS, check out Stephane Maarek on Udemy. He is excellent!</p></blockquote><p>Remember when I said I was impatient? Well, when watching this 4-hour long video, I actually set the speed to 2x and enabled closed captions. I often do this on learning videos simply because reading is faster than listening, and I find that instructional videos are generally more slowly spoken anyway. You may prefer a slower speed like 1.5x or even a faster speed, but I highly recommend this if you&#8217;re able to digest and retain the information.</p><h2>Microsoft Learn</h2><p>John&#8217;s videos are great, but there&#8217;s no way to cover hands-on experience in a format like that (it&#8217;d be longer than even 4 hours&#8230;), so I also tried to complete all the Microsoft Learn paths listed right there on the AZ-104 exam page. Plus c&#8217;mon, it&#8217;d be a discourtesy to ignore all that free material from the very platform that you&#8217;re learning about.</p><p>As of this writing, there are 5 learning modules listed with this exam, covering topics like compute resources, virtual networks, storage, and identities and governance. Each learning path has several Modules, and each Module contains several Units. I&#8217;m not sure how Microsoft estimates this, but when viewing a Learning Path you&#8217;ll see an estimated time &#8212; usually several hours for a single Learning Path. Personally I was able to complete these much quicker, usually about a couple hours per Learning Path. Most of the content is shear reading, then some hands-on labs, and finally a very short knowledge check (multiple choice questions).</p><p>Some contain simulated Labs, which are just a video-like experience that you can click through without spinning up resources in your subscription; these are great because it lets you see the Azure portal UI as you walkthrough a scenario but you can still click through at your own pace. Real hands-on labs require a Sandbox subscription, which you can get up to 10 per day, and are the real deal as you literally spin up resources in Azure (don&#8217;t worry, the Sandbox costs you nothing, just stick to the script as far as what you&#8217;re supposed to be deploying), but these take the longest amount of time so be practical and complete these if it&#8217;s a topic you really feel you need the extra experience on.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fUw9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fUw9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 424w, https://substackcdn.com/image/fetch/$s_!fUw9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 848w, https://substackcdn.com/image/fetch/$s_!fUw9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 1272w, https://substackcdn.com/image/fetch/$s_!fUw9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fUw9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png" width="1826" height="1568" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1568,&quot;width&quot;:1826,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fUw9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 424w, https://substackcdn.com/image/fetch/$s_!fUw9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 848w, https://substackcdn.com/image/fetch/$s_!fUw9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 1272w, https://substackcdn.com/image/fetch/$s_!fUw9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff033a168-bb8f-428b-9f6a-20f5f8c15dbc_1826x1568.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This may sound silly but my greatest challenge was honestly just staying awake while doing the Microsoft Learn content. Reading instantly makes me sleepy, and I need to re-read paragraphs a few times when my attention span gets cut short. I found myself usually working through about 1 Learning Path per day, but sometimes less and it&#8217;d take multiple days, and usually right before going to bed so as not to disrupt my sleep schedule too much when I inevitably got drowsy from reading.</p><p>I want to emphasize that my rapid pace method is what works for me, and you should NOT try to &#8220;game the system&#8221;, taking these exams simply as early as you can. The point is not just to tack on another badge to your resume or profile, and nothing can replace real hands-on experience. So in addition to the above study methods, I also spent a little time doing hands-on things in my own Azure subscription like creating storage accounts, trying out Azure Files, creating an Azure Functions App that automatically cleans up Resource Groups based on a tag, using RBAC to allow that very Function App permission to do its behaviors, etc.</p><h2>Exam Experience</h2><p>I scheduled my exam for a Friday very early in the morning at 7am (in hindsight, this was a terrible idea). Honestly I wanted to get it over with before the weekend, but early enough so it didn&#8217;t interfere with me getting to work that day. I took the exam at home/online, proctored by Pearson Vue (Microsoft&#8217;s choice of testing provider at the time of this writing).</p><p>The night before I made sure to prepare my room where I&#8217;d be taking the exam &#8212; clearing off my desk, covering up any bookshelves, removing any pictures with words on them, etc. This may sound overkill, but in my testing experience it really depends on the proctor. Make sure to read the exam guidelines and room requirements in advance before taking your tests.</p><blockquote><p><strong>Pro-tip</strong>: you can begin your check-in to Pearson 30 minutes before your appointed exam time, and you should <strong>always</strong> take advantage of that. It almost always comes in handy for dealing with technical difficulties or minor requests during the check-in process such as forgetting to unplug your monitors, etc. Also, make sure to close out of every other application beforehand so it doesn&#8217;t interfere with the testing software or get flagged.</p></blockquote><p>For obvious reasons I will not be sharing details about any questions from the exam. However, once I got into my exam, it let me know how many questions I would have (e.g. multiple choice, case study, etc.) just as Microsoft says on the Learn website. I went through each question carefully. On questions I was unsure of, I clicked the checkbox to mark it for review so I could come back to it at the end if there was time.</p><p>When I got to the end, that&#8217;s where it gives you the chance to go back and review applicable questions, so I used my extra time to double back on some. Note: Microsoft now allows you to access the Microsoft Learn website during the exam so you can lookup API references and documentation, which I found really useful for questions about specific things you&#8217;d have trouble memorizing.</p><p>My only piece of feedback to Microsoft for the exam, was the false sense of &#8220;completion&#8221; I got when I answered all the questions from the first &#8220;section.&#8221; Your exam might have different sections, like a multiple choice section followed by a case study, for example. In my case, I was faced with the screen showing me the questions I could go back and review, so I thought I was at the <strong>end</strong>&#8230; it was only when I was about to submit everything at the end that I clicked &#8220;Next&#8221; and realized there was a whole other section waiting for me &#129318;&#8205;&#9794;&#65039; I got lucky and scored high enough on the rest of the exam that this didn&#8217;t matter, but please learn from my mistake! And hopefully Microsoft, or Pearson, makes the delineation between section ends vs the exam end better in the future.</p><h2>Conclusion</h2><p>Everyone learns in their own way and at their own pace. I hope my story is helpful to you reading this now and inspires you to take that next step in your learning and development journey.</p><p>Have you received an Azure or AWS certification recently? Comment below what your experience was like, or if you have questions about what to learn next.</p><p></p>]]></content:encoded></item><item><title><![CDATA[AWS Costs Saving Checklist: Quick and Easy Ways to Reduce Your Bills]]></title><description><![CDATA[Review this simple checklist to quickly optimize your AWS cloud costs. No matter how big or small the organization, you can find savings here.]]></description><link>https://www.makewithdata.tech/p/aws-costs-saving-checklist-quick-and-easy-ways-to-reduce-your-bills</link><guid isPermaLink="false">https://www.makewithdata.tech/p/aws-costs-saving-checklist-quick-and-easy-ways-to-reduce-your-bills</guid><dc:creator><![CDATA[Zach King]]></dc:creator><pubDate>Tue, 02 Apr 2024 01:40:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!thUk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!thUk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!thUk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!thUk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!thUk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!thUk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:165488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!thUk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!thUk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!thUk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!thUk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72e89681-bdfd-4807-84e8-db847b9c183a_1024x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>It&#8217;s 2024 and every tech company is wanting to <em>do more, with less</em>, and that means reducing AWS costs. Review this simple checklist to find quick wins that save AWS costs on your next bill.</p><p>Also, with industries everywhere taking up <a href="https://www.databricks.com/company/newsroom/press-releases/databricks-introduces-new-generative-ai-tools-investing-lakehouse">GenAI</a> initiatives to find the next way to add value and distinguish themselves in their market, the total cost of ownership for cloud infrastructure is becoming even more important.</p><h2>1. Use gp3 EBS Volumes vs gp2</h2><p>If you use EC2 instances at all, you&#8217;re definitely using EBS volumes as well, even if it&#8217;s just for the required root ( <code>/</code> ) volume. An often-overlooked saving is upgrading from gp2 to gp3 generation EBS volumes, mainly due to their pricing structure and performance flexibility.</p><p>Here&#8217;s why you should switch to gp3 if you haven&#8217;t already:</p><ul><li><p>Lower costs per GB. <a href="https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/">Up to 20% lower, compared to gp2</a>.</p></li><li><p>Generally more performant all around &#8212; offering the base performance of 3,000 IOPS and 125 MiB/s.</p></li><li><p>Flexibility to scale IOPS (Input/Output Operations Per Second) and Throughput independent of the storage capacity. This means scaling up to meet disk performance demands is more cost-effective than gp2, which requires provisioning larger volumes.</p></li></ul><p>Don&#8217;t take my word for it, check out Amazon&#8217;s blog on gp2 vs. gp3 volumes here: <a href="https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/">https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/</a></p><h2>2. Use VPC Endpoint for AWS S3 Costs</h2><p>Do you know how your requests are reaching Amazon&#8217;s APIs when using S3? Ever wonder why your NAT costs are so high each month?</p><p>By default, network traffic to access Amazon S3 from a VPC is routed over public Internet. This typically uses a NAT (Network Address Translation) gateway, which is charged per GB processed. Your VPC is in AWS, and so is S3, so wouldn&#8217;t it nice if we could keep that traffic on AWS&#8217;s backbone&#8230; yes!</p><p>You can create a VPC Endpoint for S3, which requires zero additional configuration to your applications or workloads, and ensures S3 requests are routed directly from your VPC to S3. This direct connection can often reduce your NAT costs by 50&#8211;70%, depending on data volume and which region you use, but also improves network performance.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3e7M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3e7M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 424w, https://substackcdn.com/image/fetch/$s_!3e7M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 848w, https://substackcdn.com/image/fetch/$s_!3e7M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 1272w, https://substackcdn.com/image/fetch/$s_!3e7M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3e7M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png" width="1024" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Credit: https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Credit: https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/" title="Credit: https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/" srcset="https://substackcdn.com/image/fetch/$s_!3e7M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 424w, https://substackcdn.com/image/fetch/$s_!3e7M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 848w, https://substackcdn.com/image/fetch/$s_!3e7M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 1272w, https://substackcdn.com/image/fetch/$s_!3e7M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e4fce2-b9eb-4f0a-94a3-48c9df670332_1024x238.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Credit: <a href="https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/">https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/</a></figcaption></figure></div><p>Again, don&#8217;t take my word for it, read Amazon&#8217;s blog and instructions here: <a href="https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/">https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/</a></p><blockquote><p><strong>Note:</strong> you can create these endpoints for certain other AWS services, such as DynamoDB, Kinesis, Lambda and more, but S3 is the most beneficial for data-intensive organizations. Please also note the difference between &#8220;Gateway&#8221; and &#8220;Interface&#8221; type endpoints.</p></blockquote><h2>3. Watch out for S3 object versioning</h2><p>Accidents happen &#8212; we&#8217;ve all been there when something gets accidentally deleted and panic ensues. That&#8217;s why we create backups and build BCDR (Business Continuity / Disaster Recovery) plans.</p><p>If you use S3, you may have given yourself added protection to accidental deletion by enabling <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html">S3 Versioning</a> which keeps versions, or copies, of every object in a bucket and allows recovery to a previous version.</p><p>However, the risk you run with this feature is letting additional versions accumulate out of control, especially if applications overwrite the same objects frequently such as in data lakes and Lakehouse architectures. The extra versions of your objects are referred to as &#8220;noncurrent&#8221; objects because they&#8217;re not the active copy of the object. Add a lifecycle rule to your buckets to ensure &#8220;noncurrent&#8221; objects are expired after <strong>X</strong> days; you can make more complex lifecycle policies as well if you need to retain at least 1 noncurrent copy.</p><p>See the user guide for more information about deleting object versions: <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeletingObjectVersions.html">https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeletingObjectVersions.html</a></p><h2>4. S3 Lifecycle, Storage Classes, and Analytics</h2><p>Amazon provides a few great ways to analyze the volume and trends for your S3 buckets: <a href="https://aws.amazon.com/s3/storage-analytics-insights/">Amazon S3 Storage Analytics Insights</a> which includes <em>S3 Storage Lens</em> and <em>S3 Inventory</em> features.</p><p><a href="https://aws.amazon.com/s3/storage-lens/">S3 Storage Lens</a> is useful for identifying your largest buckets by object count, physical size, noncurrent versions, and how these metrics trend over time. For example, when you implement #3 from this checklist, use S3 Storage Lens to view the percentage of your S3 storage occupied by noncurrent versions. When you implement lifecycle rules, Lens is helpful to validate those changes (in addition to seeing &#129297; in Cost Explorer).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L16x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L16x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 424w, https://substackcdn.com/image/fetch/$s_!L16x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 848w, https://substackcdn.com/image/fetch/$s_!L16x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 1272w, https://substackcdn.com/image/fetch/$s_!L16x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L16x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png" width="1181" height="421" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:1181,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Credit: https://aws.amazon.com/s3/storage-lens/&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Credit: https://aws.amazon.com/s3/storage-lens/" title="Credit: https://aws.amazon.com/s3/storage-lens/" srcset="https://substackcdn.com/image/fetch/$s_!L16x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 424w, https://substackcdn.com/image/fetch/$s_!L16x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 848w, https://substackcdn.com/image/fetch/$s_!L16x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 1272w, https://substackcdn.com/image/fetch/$s_!L16x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc441aaf5-9605-4718-aec4-dd65cec7a032_1181x421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Credit: <a href="https://aws.amazon.com/s3/storage-lens/">https://aws.amazon.com/s3/storage-lens/</a></figcaption></figure></div><p>Another useful way to look at your data in Lens is by storage class. There are other lifecycle rules you may be able to define on S3 buckets to get the best price-to-performance, in addition to the default S3 Standard. For infrequently accessed data, S3-Infrequent Access (S3-IA) and S3 One Zone-IA are suitable classes, with One Zone being ideal when high availability isn&#8217;t necessary, leading to a cost reduction of up to 40%. For cold storage access very rarely, like backups and archives, use one of the S3 Glacier storage classes providing up to 95% in storage cost saving depending on data volume and specific Glacier tier.</p><p>If your S3 access patterns are predictable and designed with enough separation of buckets and/or path prefixes, you can save tons of money by defining lifecycle rules on buckets to transition objects from hotter storage classes (e.g. S3 Standard) to cooler storage classes when appropriate.</p><p>Alternatively, if your access patterns are less predictable or changing, you may benefit from <a href="https://aws.amazon.com/s3/storage-classes/intelligent-tiering/">S3 Intelligent-Tiering</a>. With this feature, you pay a small monthly object monitoring and automation charge, in exchange for Amazon automatically monitoring access patterns and moving objects to a lower-cost access tier when they haven&#8217;t been accessed enough.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SZbd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SZbd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 424w, https://substackcdn.com/image/fetch/$s_!SZbd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 848w, https://substackcdn.com/image/fetch/$s_!SZbd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 1272w, https://substackcdn.com/image/fetch/$s_!SZbd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SZbd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png" width="728" height="299.22033898305085" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/53295717-4920-4e8e-818e-f57e8babb973_2360x970.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:2360,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Credit: https://aws.amazon.com/s3/storage-classes/intelligent-tiering/&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Credit: https://aws.amazon.com/s3/storage-classes/intelligent-tiering/" title="Credit: https://aws.amazon.com/s3/storage-classes/intelligent-tiering/" srcset="https://substackcdn.com/image/fetch/$s_!SZbd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 424w, https://substackcdn.com/image/fetch/$s_!SZbd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 848w, https://substackcdn.com/image/fetch/$s_!SZbd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 1272w, https://substackcdn.com/image/fetch/$s_!SZbd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53295717-4920-4e8e-818e-f57e8babb973_2360x970.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Credit: <a href="https://aws.amazon.com/s3/storage-classes/intelligent-tiering/">https://aws.amazon.com/s3/storage-classes/intelligent-tiering/</a></figcaption></figure></div><h2>5. AWS Costs: Everything else &#8230;</h2><p>There are honestly several ways to keep costs down, and I&#8217;d rather you walk away from this equipped with as many tactics as possible. So if you read this far, please accept this greater list of cost saving optimizations:</p><ol><li><p>Use <a href="https://aws.amazon.com/ec2/spot/">Spot Instances</a> for low-impact workloads that don&#8217;t require high availability, and save up to 90% compared to On-Demand prices.</p></li><li><p>Use <a href="https://docs.aws.amazon.com/whitepapers/latest/cost-optimization-reservation-models/savings-plans.html">Savings Plans</a> to save up to 72% on compute by committing to one- or three-year terms of usage.</p></li><li><p>Routinely check AWS Compute Optimizer for recommendations and <a href="https://aws.amazon.com/aws-cost-management/aws-cost-optimization/right-sizing/">right-size your instances</a> to fit the workload.</p></li><li><p>Avoid using CloudWatch for custom metrics&#8230; you&#8217;re better off with open stacks like Prometheus + Grafana, or even a managed SaaS for observability.</p></li><li><p>Use Auto-Scaling Groups (ASGs) to reduce overpaying for idle resources.</p></li><li><p>Prefer Container based services like ECS or EKS over individual host/VM-based deployments. Containers make it easier to ensure full utilization of resources so you <em>get what you&#8217;re paying for</em>.</p></li><li><p>Add retention policies on your CloudWatch Log Groups. Don&#8217;t accidentally leave these with the default retention, <em>Never Expir</em>ing!</p></li><li><p>Search for unused EBS volumes, or volumes in the &#8220;Available&#8221; state.</p></li><li><p>Consider putting a AWS CloudFront distribution in front of your S3 bucket to provide caching and reduce data transfer costs or cross-region movement.</p></li><li><p>Remember that <strong>DynamoDB</strong> has an <a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.TableClasses.html">Infrequent-Access</a> table class, similar to S3 Infrequent-Access (S3-IA).</p></li></ol><h2>Before you go&#8230;</h2><p>I hope you find these tips useful for keeping your AWS costs lean and it enables you or your organization to do even more!</p><ul><li><p>Have a cost saving tip of your own that I missed? Leave a comment and I&#8217;ll try it out!</p></li><li><p>Follow this newsletter and my <a href="https://www.linkedin.com/in/zcking/">LinkedIn</a> for more tech content, or on <a href="https://www.youtube.com/channel/UCyjgpEJIJbT7w7vFQ2fc4XA">YouTube @MakeWithData</a>!</p><p></p></li></ul><p></p>]]></content:encoded></item></channel></rss>