Pause
Expand
enhancing data pipelines with airflow sdk: a comprehensive overview

[00:00:00.650] - Speaker 1

You.


[00:00:02.210] - Speaker 2

Hi everyone. Good evening. Thank you very much for being here. This is the 10th London airflow meetup. It was asserted by Paxo, who just by me left. He's one of Airflow PMC members, one of the top committees of the project, together with Vogue who was the second contributor to Chapat Shareflow and Ash as well, because another PMC member. It's a huge pleasure to have you here. We're still organizing the community. So before the Pandemic we would have regular meetups. Since the Pandemic we had to think some online meetings and then we are trying to get back on track. There's a chance the next meetup will be made. If you have any venue and you would like to host, we would very much appreciate. We would like to thank Evan over here. He has a company at working and his office is here in Accelerator. And thank you very much for making this space available even and. Well, our speaker today flew from the Netherlands, boke So. He's been contributing Chapatch airflow since 2015. He was the second person to contribute to the project since then. He came a long way. Now he's the VP of the project and he's a main contributor as well.


[00:01:29.390] - Speaker 2

So thank you very much for coming over. Thank you.


[00:01:35.630] - Speaker 1

Thank you. Tajana for having me. Hello. Good evening. Before I start, I see some empty couches in front. I know you're very afraid of being in front, but I can offer you some free t shirts. If you start being in front of, we stop. You're going to be in front, right? You promise? Here you go. I hope you have to be fast though, because there's little space left. Do you want to sit down next to Castle? It's fine. Or are you going to be shy? Go free t shirts. You're still here. Different sizes too, so don't worry about it. They are for everyone. Of course they are for everyone. I need to make it a bit nice here as well. I like to be close to and today, yes, I'm going to talk about sexy data pipelines. Obviously a little bit of a teacher teaches her, but I hope you'll have fun, enjoy doing it and enjoying with me a little bit of this talk. I'm not going to give an extremely technical talk if that is what you expected, but everything what I'll be showing is going to be available to you. It's free, open source.


[00:02:45.500] - Speaker 1

I do work for a company called Astronomer that you might have heard of through different Airflow meetups. But before that I only worked there since one and a half years. Before that I used to work for IG Bank for 14 years. I've been the head of an analytics department for about 150 people where we created data analytics platform products in wholesale banking and also retail banking. So I know what the uses, the output of data pipelines is and putting those into a product. But I also know it from the engineering side. I became involved with Airflow in 2015 because of one of my data scientists was saying, yeah, we need something to schedule our workloads with. And I said, there's something called the Wii. He said, no, that doesn't look cool. You need to look at that thing from Airbnb. And I started looking at that thing from Airbnb. I said security. Shit. So let's fix that and let's change that. So I did kind of a pull request of about 50,000 lines changed. I think it was that. Yeah, it was quite big. And Maxim was the guy we invented Airflow and now runs Superset, kind of merged it right away.


[00:03:59.850] - Speaker 1

And I was like, what did you actually look, what I was writing was because what's my first Python code ever? And then we put it into production at Airbnb, and they were running at that time about 60,000 tasks per month. We were running ten per month. So that was kind of compared to where we were at. That's also the inviting, the invitation to you, by the way, for the future, if you'd like to get involved. We are very welcoming community in that way, and obviously your code will be scrutinized, but we try to look at it within a positive way, and that you try to scratch it if you try to solve the socket. But today I will be talking about sexy data pipelines. And I'll sprinkle a little bit with astronomer, as I explained. But like I said, everything is available to you. Around 640 ish 645. I understood there might be pizzas arriving. I understand that you might be hungry. Don't hesitate to go to the back and grab one of those pizzas. If it comes very much like a frenzy, which could happen, I will create a break inside my talk. Yeah, just put your hand up.


[00:05:09.920] - Speaker 1

If you feel like having a friendly, it's okay. I'll immediately break. Yeah. All right. Enjoy it and say yes. Here we go. We are in need of a new data pipeline. So, yes, we go out for dating. And typically you do that on a dating app nowadays. Modern times especially, we just ran out of the pandemic. We do not enter any restaurants anymore. So we'd like to go to Tinder for new data pipelines. In this case, I hope it's gender neutral, because we go both ways, right? Or always, whatever you want. We are looking for a new data pipeline. We do not care when it resides and what you look like, but we do like a sexy data pipeline for the ones that you are more English than I am. I assume it's about the second definition that we're talking about appealing, not eroticism, nothing like that. That's for later. Your own download dating, that's fine, but we're not there yet. First, we go for the appeal side. If you want to get to the first side, be my guest. But not when I'm around with the data pipeline, I'm not sure that's going to work.


[00:06:31.260] - Speaker 1

So it's about having an appealing data pipeline. And I think in that case, we're a bit in for a bit of new things. But why do we need such this existing data pipelines? Why is that actually important? A little bit of theory. So in case you're in a company, you like to make data driven decisions, because data driven decisions, that's what makes a company, gives a company competitive advantage. That means that those decisions need to be fed by data and we need to transform the data into something quicker in the meantime, that we need to transform the data into something meaningful before you can take that position. And imagine if you're in a startup working there, then it's kind of easy to transfer the data that you're actually dealing with. When you obtain it, you can just walk to the other person and say, here, you got the data. Can you please explain what's in there? And then you're okay. Right, that's not so hard. But if you go to a larger company, then suddenly you're the next team member or your next team that needs to deal with the things that you're creating is going to be across a different, across a different continent, maybe on the other side of the world.


[00:07:50.090] - Speaker 1

They do not, you know, you can't even figure out a time zone or a time slot where you can actually meet with and have that kind of explanation. They might not even know that you exist, which happens as well. So with the size of the company, so the complexity increases in dealing with data. You can't move, you just walk over to the other person anymore. But at the same time, a larger company, a large company, has more potential for taking data driven decisions. Because in a front office of a bank where there's maybe 50,000 people working and half of them is making decisions every day for their clients, those decisions can be 25,000 decisions. If you're in a startup that is going to be smaller because maybe you're with just ten people, maybe 20. So imagine that if you do it linearly, then you end up with 20 decisions for the startup, or 10,000, or 25,000 for the, for the, for the large bank. But imagine if you could actually deliver those decisions back to your organization. Suddenly you guys can get to an exponential effect. And that's what you would like to reach eventually with bringing data to the right place.


[00:09:09.180] - Speaker 1

But this means that you need to do a certain kind of data orchestration before you get there. Because if you do not orchestrate your data by organization and by technical means, obviously the data won't arrive there and you will actually end up in a negative, in a negative way. And this is what we experience with companies, large companies, being slow to take the right decisions or not taking the right decisions at all or the wrong decisions. I don't know if you ever get dealt with the energy utility company that gave you the wrong bill or you need to get a refund for something. Those are actually also near the wrong decision being taken. I see the pizzas already have arrived. My smell is a little bit gone. If you feel the need to go to the back, you're very welcome to take it easy. Don't trip over each other. It's not too eager yet. Kind of impressive. That's kind of impressive. Thank you. So the idea is actually to get to exponential decision making. And our premise, my premise is basically that one way to do this is to actually have proper data orchestration available within your company, that you're bringing the data in an efficient and effective way to where those decisions are being taken.


[00:10:28.460] - Speaker 1

And that doesn't mean that it's just an analyst in the front office taking that decision. It can be the CEO, this can be even the products where automated decision is Outmate decision making. What makes a data pipeline sexy? Well, this is the wish list. Obviously we're on the dating app, so we kind of have in our minds what we match against with our potential partners. So yeah, like any part we want low maintenance. We don't want to whole time being calling each other, is something wrong? No, we don't want that. So we like to have a low maintenance day pipeline. We obviously also want to do the hard work for you that's maybe at home you want to have the toilet cleaned. You want to make sure that the shower is also clean and your clothes are folded and preferably that dinner is made ready at the end of the day because then your stomach is filtered and you're going to be very happy. Same thing goes a little bit for a pipeline. You want it to do the hard work for you. You don't want to do transformations yourself. Obviously you want to have it help you where required.


[00:11:48.450] - Speaker 1

And obviously you want to look cool a little bit. That's the status symbol. That new data fly by. If that same fly by next to you looks really cool, then that shines to you as well. So yeah, it needs to make you look cool too. Obviously don't get in my way if I need my private space. So if I have my private space when I require that, don't bother me at that moment and eventually I should actually also solve all my problems. Let's see how far we can get. It swipe. Nice. So let's start swiping. Obviously to get to the right partner. This is the autonomy of a pipeline. I don't know, there are some places where he says there's a Dag involved or maybe very technical people. We won't actually know what a Dag is. So it's called direct. If you're into airflow and I don't know how many people actually use Airflow? I didn't. I forgot to ask. Can we raise some hands there? It's pretty impressive. I must have been afraid that it was going to be very unknown from what I understood. Okay, so we're fine. So you know what a dag is does not sound very sexy though.


[00:13:08.730] - Speaker 1

I was afraid of that. And although it didn't look like it, it's probably the language that you know, but it doesn't feel entirely right anymore. I would say if you're a data scientist and you're going to be asked to actually write a DAC, people think, why would I do that? Please allow me to train my model, create my possible. I really don't want to create a DAC. What the hell by the way, is a dag? Don't want to do that. I just need data to arise in time for what I'm doing. So no, this is absolutely not so sexy. And I have to admit I took this picture from one of my colleagues also promoting your product in another way. And so I imagine, I think we can uplift a little bit, but we're doing so now. We put the X button there. We definitely don't like that. Let's swipe to the next one. Well, this didn't die, so fortunately but it says please fill in your question or ask your question. Love that of interaction. They're asking me a question for change. No openings, no saying, right? So I say, hey, you're asking me a question.


[00:14:21.740] - Speaker 1

Great stuff. I like your profile text. Here you go. That's why I get a little bit further though. Hey, you can actually ask chat UPT to write you an Airflow DAC if you want, right? I mean, easy enough. So you can ask it. You ask to write the DAC that queries to Snowflake database and puts the regionals into an S three bucket and it actually works. It's perfect. It's not a problem. It's not a problem. You can actually get this running. Obviously you want to fill in a little bit, other things maybe, but it gives you an example deck that you can work out and the code is correct. However, maybe you've been in one of the dating apps. Did you think that the person that you were meeting was actually at the age level that you were thinking that you were getting chattypty? You look kind of old because indie chattypty only was indexed until 2021 somewhere hallway. I don't know if it was the end of 2021 and we've moved on from this. You don't have to do this stuff anymore. You can do it in a more and more modern way where there's less code involved.


[00:15:36.480] - Speaker 1

Maybe even this stuff is you don't need to nowadays you have a task flow. The task flow API if you know it. So you can just decorate your task. If you're a data scientist, you would love that because you just decorate your task and then somebody starts working but that means that Chat GPT is a little bit older than I expected, unfortunately. So well, on the wish list, we do get some marks, right? Low maintenance? No, I definitely need to adjust it to modern times. I need to update this data pipeline to something that is workable in modern times. It does do the hard work for me. Quite used in that way too. Maybe I find that acceptable with the age gap. Seems a little bit large. Still doesn't make me look cool. I don't know about you, but everybody's using a Chat TPT nowadays. It gets kind of boring and people are being called out for it too. So it's not so original anymore. So cool. Don't know, don't get in my way. I don't know how often you've tried Chat DPT at this moment. It also knocks me out because there are too many people.


[00:16:49.140] - Speaker 1

So it does get in my way. So I'm like, does it solve all my problems? I don't think so, because that first thing was already still a problem for me, and otherwise I need to start paying. And that sounds really weird in the dating context. So now I still think it is an X. Then we have something called Astronomer Antikate. Oh, it's such a crappy name. And if you come up with a better name, please come over. But yeah, sometimes ducks are a bit deceiving that way, but it likes ducks. And I don't know if you know what Duck DB is, if you're a little bit do you know what Duck DB is? Do you know? SQL Lite? Yeah. All right, think SQLite. But then for online analytical processing in process. And that's duck DB. And Duck DB is awesome because it allows you to do that online analytical processing for large data sets on your small computer. Because they basically decided that 90% of all the big data use cases are not big data use cases. They are actually things that fit on your computer, definitely nowadays. So WDB is doing this on a sexy hardware like you're a Mac or like you're in a ThinkPad.


[00:18:09.920] - Speaker 1

That guy is doing it on ThinkPad. In the meantime, I think solving everything, then WDB is your thing. And the great thing is, yes, K actually supports it. It does also love pandas for some reason. I love cuddly bears, but this one loves pandas too. And do you know what? It's polycloud. And that means that you can take it everywhere and it's very modern, I would say to be poly, but then being Poly Cloud maybe sounds even better. So yes, I think we should dive in a little bit further. So the target for the Astro SDK, and actually I would ask you to actually tell me it's a bad name, so I can actually give that back to our product management and come up with a better name because I'm doubling over this Astro SDK thing. But nevertheless, it's something that you run on top of Airflow. And it's available for free, by the way, so that's like I said, everything we showed here today is open source entirely. We don't keep things back. The target for that is kind of having 80% reduction of the code that you're writing for your pipeline. Sounds reasonably sexy.


[00:19:28.120] - Speaker 1

At least you get more time to do other stuff. That is nice. And it is authoring reinvented for not just data engineering, which typically was for Airflow, but it's geared more to data scientists as well. So they like to write models to other data scientists in this quiet. I thought it was one of the sexiest job that you could get who's a data who's considering himself or herself a data engineering it's pretty big software engineering. Management. Management. Okay. All right. For the nonexistent data scientists in the room, we also wrote code for you to make your life easier. Maybe then you would like to be at an Airflow meetup eventually in the future. Spread it around to your data scientist friends, please. The idea behind it is to have right to write self documented Pythonic code, which I would say Airflow Dags are typically not, and that you would be able to move between relational source of Python data structures without effort and load export files to local and remote stores in online. Now, I am a data scientist, right? And that's a use case on the marketing team who wants to perform a sentimental analysis of Reddit posts about Airflow.


[00:21:05.510] - Speaker 1

I need to get raw data from the Reddit API, but put it in someplace clean it, pass through the interference model and merge resource in the reporting reporting table. I will not give you the entire example because that will be a little bit elaborate for this one. It's available to you. We'll show you it in the Getter page. We're there, right? Yeah, we're good. All right. So the use case is extract filter, analyze, join alone. Kind of a typical way actually, for a data scientist actually to work. Maybe you could say the analyze in the join part. And you could also say that's a training model and having a model eventually can replace that bit and emerging that in the future table is also something which would typically fit with a data scientist workflow. I don't believe you just like that. Just give me an example before we go on a date maybe. All right, so this is standard airport. So you saw it maybe in Chat GPT already a little bit. Hey, we have that snowplake operator creates a table. It's temporary table. So it kind of feels like, okay, why should I create a boilerplate thing that temporary table that you need to think about all those steps that are not really tied to the thing that you need.


[00:22:39.350] - Speaker 1

Right? I mean, a temporary table is something you use as an in between steps. It's not necessary. The end result, what you're trying to get or anything that you truly need to get to your end result. You're basically doing it to solve a problem in your infrastructure rather than for the things that you need to do later. And then obviously, and it's a simple one, you have an asterisk not like operator that you actually get Anita staging table and it fills in the temporary table and you specify fast, dependencies at the end. I don't know about you, but it feels to me, and every time I will do actually on the Airflow summit, which will happen somewhere in September from this year in Toronto, I will do a talk called Operators Need to Die. I get a lot of people already saying, why are you going to say that? I don't agree with that. I say, show up at my talk and you definitely going to be agreeing with that, I think. And if you don't, then we will still be happy like a full room of people. Nevertheless, it just feels a little tedious, especially after all those years when I joined the movement of Airflow, that we're still writing this type of code.


[00:23:52.850] - Speaker 1

But I think the world has moved on now. Grab a pizza. It's for you, it's for you. You're welcome. Here we go. If you're hungry, it's good. And the pizzas will get cold, by the way. Do you want me to make a break to grab a pizza or you're fine, you're fine. Having a bit of a cold or pizza? Here we go. Grab the pizza, please. Five minutes or so. Don't forget about T shirts, by the way. They're also progress. We have some new logo nowadays too, so they're vintage. They're going to be moved to the pizza place. So I hope you can distinguish between the two. By the way, big round for Tatiana because big applause because she's been already I think she wants to host more of them. So she needs a lot of applause for this kind of thing. All right, I'll continue and I tell her to continue eating. I can do this. Right, so I explained there's a lot of code involved with this part and imagine what then happened here. All that code went down to one line, basically. Yeah, we split it off across four, but that's more for readability.


[00:25:42.330] - Speaker 1

The other one was also split across multiple lines, but this is this code captured into this line. So there's a lot less room to make mistakes. There's a lot less room for error in a way that you're thinking not entirely aligned up with what you try to accomplish. Somebody else in another meetup actually mentioned that this does hide a couple of other things. Yes, it's actually correct because obviously if you do want to create a temporary table, all these kind of things, or you want to deviate from what is being supplied, then obviously this doesn't fit. But we don't stop you from using the other one it's still fine, but this one, for many of the use cases, 80% of the use case will be covered by this kind of thing. Hey, I need to load some data, move it from something else so from some place to something in there. And then this can actually work. I just load it into a table, get a file, load it into a table. Sounds kind of nice. I would argue it makes my life at least it would make it a little bit easier. So that's 80% reduction of code, I would argue, in comparison with other examples being shown.


[00:27:06.390] - Speaker 1

So going back to the wish list, well, it's low maintenance. Yeah, absolutely. Less error for code, less room for errors. That means my testing code is going to be less and I'm being helped by this framework. And it's no major, because if I need to change something, I just change one thing instead of many things. I don't have to create a temporary table and maybe make a typo there or change it in the future. It does all the hard work for me as well. Yes, it created that temporary table for you in the back end. It did reach out to the original source and actually pulled it into a table and inferred all the columns that you need to create. Oh, does it make you look cool? Well, in data engineering lands, it makes you look cool. Right. I mean, it could be cooler things, but in data engineering land, it makes you look cool. Or you could see from the positive side because it's because it reduces the time that you spend on. You have more time to go to a pub later on. You're invited to go to a pub later on with us tonight as well, to talk over there.


[00:28:12.640] - Speaker 1

We still need to figure out the actual place, so if anyone has suggestions, problem with that? I'm not from around here. I'm not local, so I don't know where to go. Doesn't get in my way. No, it doesn't get in your way because you're still able to go back to the other way of doing things. So we're still flexible. This is what airflow is known for. Extremely flexible. Sometimes overly flexible. Half ash. Yeah, maybe a bit. Sometimes we're overly flexible, but this really the flexibility remains. You can still go pull back to whatever you require on the back end. Does it solve all my problems? Probably not, but I guess that's over asking, because can't expect it from your partner. You have to fix it yourself, I would say, to fix your own problems. That's not for somebody else to solve. So getting back to the part yes. Do we like ducks? Yes. Do we like the astro, where it's okay apart from the name. With this quick suggestion, we should have a post box, I guess, here with suggestions to get a better name. The poly part. In this case, we like it we're up for that.


[00:29:30.470] - Speaker 1

We're modern. So yes, let's push the love button and guess what? It's a match. Great stuff. Great stuff. But now guess what? What would you do next? Very scary. You want to go on an actual date, obviously. So that's date, that's the important one. How do you actually try this out? Now you want to go on a date, but that new thing that lives kind of far away. So you decide to go hop on a flight. But obviously, being data driven as you are, you would like to know what the likelihood is of you making it to that actual day. Actual day. And that you are not ending up with a canceled flight. Because that doesn't help, right? I mean, you and you maybe you definitely find the new data pipeline appealing and hopefully it's neutral. So yeah, you need to do something about that. So here I have a and I'll share it later with you don't have to do it live if you want, but I'll just quickly you can go to sorry. If you'd like to try this out at home, it's right here too. It's also allowed. Can try it at home too.


[00:31:02.130] - Speaker 1

It's also allowed. No problem. There's not going to be anyone alert. You just need a docker installation and you need the Astro CLI. The Astro CLI is nothing special in the sense that we don't do any corral stuff that you might be afraid of. We just make your life a little bit easier in setting up Airflow up and running. A little bit faster than maybe the typical pip. Install Apache Airflow and then getting all the things running and it sets up the databases for you, all these kind of things. So that makes your life a little bit easier. And then you clone this repository. And that's my dear colleague Jack, he was so nice to actually help me out getting to this date. So you go to the Airflow flight demo and you start up the local Airflow surfaces. Not so hard. Go to the project, open up the Airflow, dive it in and start running a couple of things. Runs the project set up is the first one. So if you read the material later on, just follow the instructions. It's quite easy. I know us guys don't typically read the manual. Please do in this case and follow the steps.


[00:32:18.960] - Speaker 1

It's quite easy. Then turn on product set up first, wait till it finishes what it does, it creates a couple of tables in the back end, all that kind of thing, and set up the databases for you. Hey, you don't have to do yourself. It's fully automated. Awesome. And then you turn on all the others basically afterwards because it will actually trigger the pudding of the data sets that are going to be required. And it actually will also train them all for you. And I will show you a little bit how it. Works in a minute. I need to get some computer to do it. As it is, it only needs to be run once, and you see it following in the grid, the new awesome grid that one of our colleagues actually created. He's updating it right now, too, to make it even nicer. So it should show green. That would be nice. If it's not green. Yeah, you need to type in a little bit. You can't debug it from here for you, but in principle, it should just all work and then enable the auto refresh. Will it actually allow you to see the two piece updates as it completes?


[00:33:28.600] - Speaker 1

And then you turn on the other ones, all the other ones. Everything else will start running for you. Basically, what I will give you is you need to trigger it. The first one you need to trigger by hand, and then that one will ultimately cascade for it. Nice one. This is what the output that we expect for it all green. I hope with the new UI, it will look even sexier, I would say. And I will just show you in a second now what that will look like in real life if we actually get to our data or not.


[00:34:09.490] - Speaker 3

Say it.


[00:34:11.110] - Speaker 1

No, I'm okay. As you've seen here, this is the exact same screen as I showed you a minute ago inside the presentation. What I can do, and I can rerun it if I want to, is actually check for new files again and it will start working again. Hold on, let me fix that for a second. I can run it from a place where it exists. There you go. It starts running. Should, in this case be very quickly because it won't find any new files, typically because I did run it before. What we did in this case is there's an open source database for you with all the flights in the US. Always in the US. But it's all in the US. And hey, in this case, we are flying to the US. Maybe with all the flights captured from I think it's 90, 87, which you can do and use for analysis, and it will show you whether the flight was canceled or the flight wasn't canceled. In this case, we are going to be using that to train a model and to predict whether the flight that we're taking today is going to be canceled or not.


[00:35:35.820] - Speaker 1

And that will then signal where we could meet that's at least as a proxy for that, obviously. So here we actually have quite a list of where those data sets are. I put it in mineo. The docker image from the GitHub page will also pull in Mino for you. So if you don't need a cloud access or anything else, mino is an S Three compatible storage for you that runs on your local host. Does everything what S Three actually allows you to do as well, or at least 80% is sufficient for what we do. It does also interact with all the S Three clients if you'd like to there if they're on client too. It makes it a little bit easier in case you're doing this from the host as S Three assumes. Amazon typically you need to do certain things to make it work with video in that way. Hopefully in this case it's already finished. As you can see, it has finished, it's done running. And then you would like to actually load the different files to the table. We can actually trigger that as well. Trigger the Dag starts running could have done it automatically because as there were no new ones, the data set would not have been updated.


[00:36:57.560] - Speaker 1

This is something which is new in the recent Airflow versions that you'll have more or less what you call let's say data set triggered DAX data pipelines have to change more sexy that you have data set triggered ones. That mean that if there's a change in an originating data set, maybe it is somewhere in S Three bucket or somewhere else that actually signals that Airflow should start working on its workflow. So that's a little bit different from constantly pulling what we typically did or having a sensor in place. Now you can trigger it the other way around. When there's a change, then multiple DAX can even start running from there, which is more natural to what you would assume. And it's not so dependent anymore. You can do event based triggering rather than pool based triggering.


[00:37:56.100] - Speaker 3

In this case.


[00:38:00.260] - Speaker 1

This has run, or maybe it hasn't run, but it's okay because it has already loaded all the things that we require. It has then get the training data there yet. You can get an overview here for the graph. As you might have seen here you do a bit of cleanup. We create a training table script, get the canceled flights, we happen the ones that are there. We get the normal flights, it happened the normal ones. And then we triggered downstream one, which means basically changing or updating the data set. For this one we train the model which is just actually boost model for the one that does a little data science and then we display the model. The test was actually a true test for WTB in this case. But that's so that's not relevant for this one. It won't show up in your installation when you try this one up. Finally we can go to the generated cancellation which is not at the moment, not in the final web page. It's not in the installation yet. It is basically just displaced table. So the table is available. We're integrating this one because we were playing around with it too and make it nice for you and having an end to end example, right.


[00:39:28.050] - Speaker 1

So we sometimes don't have everything right away available. We'll try to make it available as soon as possible. To you. But here you can actually see, hey, it's the day of the month. The day of the month is the 27th. Today we have a cancellation prediction and we have an origin city name and we have a destination city. So if we're flying out from New York to, for example, Chicago, there is a likelihood that our flight is going to be canceled. And hey, we're definitely in New York at this moment, as you can see. So I don't think it's smart to meet our new date new data pipeline in Chicago. But a lot of the other places, so many of us is fine and we can go to Detroit as well in this way. I've actually tried to show you that you can do an end to end an end to end data science project basically within airport with data set triggers with a new way of dealing with data loading and transformation. There's more to the astro SDK. We actually solve a lot of other things in the meantime for you as well. You can move from tables to data frames, for example, in vice versa, just like that.


[00:40:39.880] - Speaker 1

You can move from GF from BigQuery to snowplay just like that, if you like. So these things are we handled transparently for you. Again, it's open source. So we also accept contributions if you find any bugs there because obviously there are sometimes. But on the other end, hopefully this might makes your life easier and a little bit more productive. We have more time to have beers at the bar. So thank you very much for having me here today. In case you like to get in touch. Bulk at astronomer IO so it's my first name. At astronomer IO it's quite easy to grab. You can also reach me@apache.org if you want. If you think Astronomer is too long, you can remember that I also have freedom. NL also works, so there's a lot of boca is not such a well known well known name in the Netherlands. There are six on this planet. I've met two of them. They were not very surprised for some reason I was they were not surprised. But that's that's where you can find me. In the meantime, I have also have to say that where is Castle? Oh, he's just that Castle is hiring a software engineer for making this even better particular more in official code area.


[00:42:09.890] - Speaker 1

So if you're interested working for us, please reach out to him. He's a really nice guy. Doesn't make your lifetime at all. No hard test, just walk in. You're just there now. Just kidding. But I think it's worthwhile talking to him if you're interested in enjoying us and making the lives of data people a little bit easier. Thanks so much. And then I can take questions obviously as well, if there are any. I say that is SDK. Just when you write the code, does it kind of like help use module or at the back end in Airflow code, it's same as the code you showed before. So the SDK, the better description is maybe to that it's more like a framework on top of Airflow, so it abstracts away certain things for you which are typical steps that you would take and that's what it does. So everything it does is also it runs directly on top of Airflow and you could replace it by Airflow code if you'd like to. If you check from Airflow code, it will be same as the first you showed. Yeah, no, yours in the module, that's why yeah, absolutely.


[00:43:40.400] - Speaker 1

Any other questions? It's very quiet. Maybe we should drink a little bit more beers and then ask the questions. Right. A little bit less shy.


[00:43:54.540] - Speaker 3

Is the FDK modular. If there is a functionality that you want to implement by yourself, it's like most of the stuff in April, you can basically override in order to implement your own stuff. Can you do that with the SDK or is that closed?


[00:44:14.260] - Speaker 1

What's your exact question? Because I think you're asking multiple.


[00:44:16.990] - Speaker 3

Yeah, I mean the SDK, if, for example, you don't have a functionality, can you build your own functionality to work with the SDK or it's something that you build and use.


[00:44:29.320] - Speaker 1

So as I mentioned, I got your question right. Can you basically build on top of the SDK and use it in any way you want? More or less? Yes, you can. I mean, it's an Apache two license project. It's open source. We've developed it outside of the Airflow ecosystem to actually be able to play around with it a little bit more on what direction it is to take. Our intention eventually is to actually bring it to Airflow, but we're also changing certain paradigms which some people might not like it in that way. Right. I mean, we're a bit more opinionated how you do certain things in the SDK than what an Airflow happens. So if we can marry that, that would be perfect because there's also making this burger, obviously, and if Airflow itself has a bigger audience than the SDK currently has. So you ask the first question that you ask is the SDK popular but less popular than Airflow is? And we try to get it more popular because we think it's of added value to the ecosystem in a way. Any other questions? Anyone grabbed a T shirt chat? No. Like I said, they're painting.


[00:45:47.180] - Speaker 1

They're going to be out of sales soon. Thank you very much. I hope you enjoyed it. Like I said, don't hesitate to reach out later on. Big applause for Castle and for Tana.


[00:46:04.520] - Speaker 2

Hi everyone. So some of the people who are here in chaplains after we know it's Monday, but we're going there for a little bit. So it's brew house and heat in Hawks. It should be around two to 300 meters from here and we will have bulk ash Kakshu, myself, and we hope more community members could join. I added the details on the chat in the meetup page please I brought to T shirts from home here I would really appreciate if you could get we have small, medium, large, Xlarge, XXL please help yourself and see you next meetup thank you.



Summary
In this podcast, the speakers discuss the capabilities and benefits of using Apache Airflow and the Airflow SDK for managing data workflows. They provide a comprehensive walkthrough of setting up, configuring, and executing end-to-end data science projects with Airflow. The discussion highlights how the SDK serves as a framework on top of Airflow, abstracting away certain routine steps and making it easier to work with various data sources and formats. The open-source nature of the SDK allows for customization and extension, enabling users to build their own functionality as needed. The podcast also touches upon the potential integration of the SDK with the broader Airflow ecosystem and invites the audience to contribute to its growth and development.
article cover
About the Organizer: David Cagigas David Cagigas, the visionary founder and current CTO of Edworking, is a skilled content writer who shares his expertise and passion for educational innovation through engaging, informative articles. With a keen eye for detail and a knack for simplifying complex topics, David consistently delivers high-quality content in English, captivating readers and fostering meaningful conversations within the Edworking community. Adept at task management, David ensures timely production of well-researched content, while his proficiency in project management helps him lead teams and oversee the development of various educational projects. His holistic approach to writing and management has made him an invaluable asset to the Edworking team, driving the company's success.

Similar Articles

See All Articles
Try EdworkingA new way to work from  anywhere, for everyone for Free!
Sign up Now