Skip to content

Research summary of insights for improving Kedro's value

Merel Theisen edited this page Mar 27, 2024 · 1 revision

Introduction

Our comprehensive analysis of people's motivations for using or not using Kedro draws on extensive user research and feedback gathered across multiple workstreams since 2019. We have synthesised key insights from surveys, interviews, support tickets, forum discussions, and usage metrics to understand our diverse users' needs. This GitHub issue compiles select representative quotes. This research supports the insight summaries in #2901.

This research in part, adds definition to: https://github.com/kedro-org/kedro-viz/issues/1448

What is our supporting evidence?

IDE-focussed users want to adopt Kedro in an existing use case

Summary

IDE-focussed users will try to learn how to use Kedro by refactoring an existing use case into a project that uses Kedro. Their objective is to learn how to leverage Kedro as a framework or to adopt Kedro in stages by incorporating library components into their work.

Supporting quotes

Kimmo Sääskilahti: "I had some challenges to convert an existing project to Kedro. When you start with a starter, it's not super clear why things are there and what you need to add to your existing project." (link)

@ericmjl: "I'm wondering if it's possible to use Kedro with an existing project that was not created with kedro new? For example, I already have a project with a data/ directory, a src/ directory and more, and I'd like to start by only using the Pipelining capabilities. I've been unsuccessful on my first two attempts; I installed Kedro into the project conda environment, but the only commands available are docs, info, and new." #410

Chris Schopp: "While I would love to rewrite this project from scratch using Kedro, I'm trying to find the best way to adopt Kedro while retaining existing functionality. How exactly this should be done likely depends on how the original project is structured, but maybe there are some areas that would be common across all projects. I'm starting by building the data catalog for the final_data_set and working backwards from there. My first pipeline will be to concat data_set1, data_set2, ..., to create final_data_set. Next, I'll wrap my entire existing pipeline in a single node before gradually refactoring the existing code to use more nodes and pipelines." (link)

IDE-focussed users want to incorporate Kedro in an existing project template

Summary

IDE-focussed users leverage internal project templates provided by CookieCutter or tools that provide project templates like Poetry, Hydra and DVC. This user group might bypass Kedro because of the high switching cost when adopting Kedro's project template or the challenges with integrating Kedro with those tools. We recommend that our users start from a Kedro project template or starter, and this may not be possible.

Supporting quotes

Khuyen Tran (on her article, "How to Structure a Data Science Project for Readability and Transparency"): "I’m aware of Kedro. I actually wrote an article about it. However, Kedro doesn't integrate very well with other MLOps tools so I prefer to use a flexible framework that allows me to use other tools." (link)

Kimmo Sääskilahti: "We have our own templates, and don't want to use kedro new. Kedro is important, but shouldn't dictate how the whole project is layed out." (link)

Kyle Haver: "That you should be doing kedro new at the beginning of a project has also been a challenge for some and their adoption of it ... We have a CookieCutter template for our package environments, similar to what the project output for Kedro would be. And they're just trying to import Kedro libraries to define pipelines in a main.py." (link)

@fmfreeze: "In organisations, there are often already established CookieCutter templates for their whatever data project. Kedro would be easier to integrate into those." #2553

@DataPsycho: I always have to start with poetry first. Using poetry I have add kedro as a package for the virtual environment of the project. Then I am able to use kedro." #1722

IDE-focussed users want to choose the features included in the project templated generated by Kedro

Summary

IDE-focussed users have a lot of opinions about how they want their project template to be structured. There was a lot of variance on #208. Suppose an IDE-focussed user has committed to the project template created by Kedro. In that case, they still want the flexibility to choose which features are enabled and visible in their personalised template.

Supporting quotes

Lukas Keller: "Well, the one thing that intimidates me a bit is that it's a framework that is battery included... Django, you have a standard project structure in which you work into, and it has a bunch of things that it does for you. Whereas, in Flask, you write all the files yourself. And you know that there's only things there that you have put there." (link)

Felipe Vianna: "So let's say if you can customise the folder structure and pick the features that you want to use. Or maybe if you say, if you don't use these structures, you will not be able to use these features because they depend on that. That's fine. But if you use this whatever custom structure, you tell me this structure can still use this, this, and this features, I'll say okay that makes sense for me. Then it would be a way for me to really use it in practice, to deploy my projects on Kedro." (link)

@nraw: "Having an option of a full project vs core would be nice, so that the bloat is removed in case not needed, but it is kept for larger projects to still be standardised." #208

IDE- and notebook-focussed users will pass over Kedro for use on collaborative projects when they're the only ones that want it

Summary

Kedro is positioned as an all-or-nothing overhaul. Our users will choose not to use Kedro when placed on a collaborative project and are the only ones that want to use Kedro. Most of these perspectives are associated with adopting the framework.

Supporting quotes

Mohammad Asad Ali: "Because trust me, even having this basic structure, sometimes in many projects it's not possible, because there are many different colleagues. There are many different styles of working... every single colleague I interact with, they have a different way of how they organise their code, and how they do different things." (link)

Daniella Mendoza: "I haven't used it yet because on the engagements that I was at, they already have something in place. So it was more disruptive to kind of move everything together rather than continue having what they have and just finish it and then whatever, what was new defined from the beginning. I think that could be one downside…" (link)

Jean-Claude Ton: "I'm currently on a new project. I'm working with different data scientists and they have a preferred way of structuring projects. They would like to continue working with that." (link)

Notebook-focussed users find our framework challenging to learn because we introduce software engineering concepts, and they are also not used to splitting a project into multiple files and directories

Summary

Our project template has a lot of software engineering concepts embedded in it, some more necessary than others. It is reasonable to expect that a notebook-focussed user, unfamiliar with this paradigm from software engineering frameworks, would need help understanding what each directory and file does - either by using our documentation or speaking to an expert user of Kedro. This user group also needed help understanding the role of configuration, and some preferred writing their code in a single file, a notebook.

Supporting quotes

Anonymous: "Part of the context is that I maybe fall slightly on the less technical side, so have been mainly working out of notebooks / relying on data engineers to productionalize code. So even the environment not being Jupyter Notebooks was a little intimidating :D that being said, I think the hardest part for me was understanding where I needed to go to make changes, and accessing the folder structure. I ended up clicking around a lot while I was looking for the right place." (Kedro User Research Study Survey)

Ashraf Miah (on a tutorial about Kedro): "It's a lot of boilerplate for what can be remarkably simpler analysis." (link)

Anonymous: "Overwhelming and intimidating number of files and directories generated just by starting a project!" (Kedro User Research Study Survey)

@WaylonWalker: "Currently, if we follow the kedro-pandas-iris starter format users need to touch at least three files to add new features (catalog, node, and pipeline). In my experience, this is so overwhelming for some that they step away and choose not to use the framework." #445

Lukas Keller: "I saw all of those files pop up. And I thought, "Okay, it's going to take me a while to dig through all of them, and figure out what they all mean and do." And client meeting is in an hour, I don't have the... Yeah, it's a bit tight." (link)

Sanjay Hariharan: "That's the main thing, it's moving from this scripting notebook approach to this more directory, software engineering approach… just changing the way that we think about how to run great data science code, I think that's tough." (link)

Kaan Karakeben: "There will be a learning curve and the curve is steeper if the user is coming primarily from using notebooks. There are Kedro specific concepts that may take time to grasp." (link)

Eduardo Blancas: "Expects your project to have a specific folder layout and configuration files. This is restrictive and an overkill for simple projects." (link)

Sanjay Hariharan: "So when starting with Kedro, you're faced with all of these directories. There's like eight directories, like src and conf. Just initially looking at that is confusing, like what is happening? What is conf, what goes in conf, what is happening? You have to read the docs to really understand." (link)

Shantam Saxena: "Financial analysts or economics statisticians, they kind of have a hard time wrapping their head around [Kedro]. Okay, what is the CLI and what are the parameters and what is this config file?" (link)

@ThomasLittrell1: "Totally agreed with @datajoely's suggestion about conf naming. I've had to explain to a bunch of people what base means." #208

Maria Olivia Linh (on creating a way to update parameters in a Databricks notebook and run a Kedro pipeline): "We created these four notebooks... You have to set the parameters here. [We didn't make the data scientists we were working with use parameters.yml] Because it was a bit harder to explain how to change the parameters. And especially because like the data scientist we were working with, she knew git a bit, but she worked with notebooks and she didn't work with an IDE... So when you open a YAML file, sometimes it can get messy. Like you don't have the tabs and then you're missing a tab and you don't know why everything explodes. And so it was easier having a cell where she does change." (link)

Debanjan Banerjee (on teaching others Kedro): "When we say setup.cfg, what would that have? pyproject.toml why do need that, etc.?" (link)

Antony Milne (back in the day): "It's quite daunting at first, all the different folders and files, do I really need all this stuff?. There's unnecessary boilerplate. I got this from people who did not know Kedro." (link)

Notebook-focussed and some IDE-focussed users don't know that they can use our Data Catalog; they think that using it requires a commitment to the framework

Summary

Users assume an all-or-nothing use of the Kedro framework and do not realise they can use the Data Catalog as a stand-alone item. Our documentation for Kedro as a data registry is a very unpopular page, but we also do not talk about this functionality at all with our users.

Supporting quotes

Jordi Smit: "It's important to note that the data catalog can only be used within a Kedro project due to its specific repository setup. Sadly, Kedro has a bit of learning curve." link

Felipe Vianna: "As I said, I like the structure, I like some of the features, and I like the Data Catalog... Then it would be a way for me to really use it in practice, to deploy my projects on Kedro." (link)

@nraw: "In general, I think kedro should try to be modular. It offers a lot, but maybe people want to buy only into a specific feature, like the catalog, without changing all of their practices." #208

IDE-focussed users leverage our Data Catalog to help notebook-focused users or people who don't want to use our framework

Summary

Kedro's modular architecture provides opportunities to delight users by incrementally integrating specific components like the Data Catalog. For example, IDE-focused users have used the Catalog to empower analysts on their teams. Additionally, users who found the framework restrictive or just wanted to use Kedro for data exploration have benefited from the Data Catalog.

Supporting quotes

@bpmeek: "I have a library with several implementations of AbstractDataSet that I use to access proprietary data connectors at my employer, I would like to share this library with coworkers that are not using Kedro for use through the code API without the overhead of a full Kedro install." #2409

@bpmeek: "I use a lot of features of Kedro, but my coworkers could get benefit from having just the dataset feature. I could unify all of my companies legacy data connectors behind a single code API regardless of if they use Kedro or no." (link)

@Galileo-Galilei: "As the OP, I have a bucnh of use cases where I introduced kedro to data analysts who are not familiar with python. I introduced Kedro to them though the catalog and a notebook (with extra facilities to do SQL inside). They really enjoy the abstraction and the ability to use data from very differetn sources (old Access databases, files from s3 storage, parquet table from internal datalake, SAS for old datawarehouse... and save there results in excel). This avoids copying data and therefore increase security / data management / development speed. Kedro is a bit scary to them and I'd like to introduce the abstraction separately." #2409

Davide Ragazzon: "I only use the Catalog, and we decided not to use Kedro because it was too much structure on top to build and sometimes not flexible enough." (link)

Deepyaman Datta (back in the day): "I will use Kedro also just for some sort of data exploratory stuff ... Even though I'm not even building a pipeline or don't really care to have a new pipeline, I just use it because I can set up a data catalog, point it to some data, load up Jupyter Notebook with Kedro Jupyter Notebook." (link)

@anhoang: "I sometimes use Kedro nodes and pipelines in my projects, but I use Kedro's Data Catalog for all of my projects! It's so useful." (link)

@anhoang: "I'm only using kedro catalog to combine with Prefect and not using Kedro pipelines." (link)

IDE-focussed users workaround our ConfigLoader's assumptions

Summary

IDE-focussed users run into errors because our ConfigLoader requires a conf directory, makes users place their configuration in conf/base and needs conf/local to be present. We expected our users to make ConfigLoaders without these assumptions, but we have yet to see evidence that they have done this. Our users choose to use other tools instead of our ConfigLoader or have workarounds for the errors that we create. We've assumed that users would always start from a Kedro project, and that's not always true.

Supporting behaviour

Case A: This user suggests you should use Hydra to load configuration and uses all library components except our ConfigLoader.

Case B: These users just needed to load a single file, so they chose to import yaml, load the catalog.yml and use our Data Catalog.

Supporting quotes

@eliorc: "I rather think this is a kedro problem, as in I have no need for the conf/local directory, as I have no reference to it anywhere in my code nor my catalog. Yet kedro still expects it to be there." (link)

@Galileo-Galilei (created a custom starter for his colleagues who are data analysts to use the ConfigLoader and Data Catalog): "I think that they would complain much more if they were to generate the structure manually and got weird config loader error messages if they forgot the nested subfolder."

Davide Ragazzon: "I didn't use the ConfigLoader to load the YAML file because then yeah. That's why, because I needed a Kedro project to have that, exactly that structure so that I knew where the config was loaded from. Yes. And it just has to be called config something or params file something and so on. And I just wanted to have a different file and then I just did it in another way." (link)

@ellwise (created a plugin called Kedro-Light and is trying to explain to users that they need to create a conf/local folder too): "Kedro expects you to have both a base and overriding set of configuration files, with the former in a folder named "base" and the latter in a folder that, by default, is expected to be named "local"." (link)

@astrojuanlu: "When converting an existing project to a Kedro one, this creates unnecessary toil, because it forces the user to create some directories and files even though initially they are empty or unused." #2593

@WaylonWalker: "With hooks out, such as kedro-wings, I no longer NEED a conf directory, especially for small projects. For simple pipelines, it seems a bit much to have so many files and directories and deters users from starting small projects with kedro. I would love to build out a starter without a conf directory, but kedro throws an error." #445

@Minyus: "I agree with @WaylonWalker. It would be more intuitive and reduce the learning cost for begineers to try Kedro if conf folder is optional rather than required." #445

Clone this wiki locally