This study addresses the challenge of predicting the sensory appeal of tea based on its chemical composition using machine learning models. Given the limited availability of data in this domain, I employed techniques for data imputation and generation, including Kernel Density Estimation (KDE) for generating a training set and a Generative Adversarial Network (GAN) for imputing sensory data. I developed and compared three models: a Random Forest, a Multilayer Perceptron (MLP), and a Recurrent Neural Network (RNN), to predict an aggregated 'Overall Sensory Score' from tea's chemical constituents. The models were evaluated based on their prediction accuracy, and their predictions were visualized to offer insights into the relationship between chemical composition and sensory appeal. The trained models and data preprocessing scalers were exported into the ONNX format for deployment in a web application, facilitating the practical application of our findings.
Green tea (Camellia sinensis), a beverage enjoyed globally, is renowned for its intricate chemical makeup, which plays a crucial role in shaping its sensory characteristics. Despite its popularity, accurately predicting its sensory appeal based on chemical composition, including elements like polyphenols and caffeine, presents a significant challenge. This study aims to address these challenges by employing machine learning methods to develop a predictive model. Such a model has the potential to provide valuable insights for tea manufacturers and tea drinkers alike, explaining the impact of various chemical constituents on the sensory experience of green tea.
The initial step involved compiling the dataset from varied academic sources, focusing on tea catechins and sensory evaluations. This foundational phase set the stage for the subsequent preprocessing and enhancement steps.
- Iterative Imputer: An iterative imputer was employed to handle missing data, ensuring the dataset's completeness and enhancing its quality for further processing.
- Kernel Density Estimation (KDE) for Synthetic Data Generation: A key strategy implemented was the use of KDE for the first round of synthetic data generation. This approach was aimed at augmenting the limited dataset with additional, plausible data points, expanding it in a statistically sound manner.
- Generative Adversarial Network (GAN): A GAN, developed in PyTorch, was utilized to generate sensory data. This enriched the dataset with imputed scores, maintaining the integrity and variability of the original sensory data.
- Second KDE for GAN Training Set: A second KDE-generated dataset was specifically crafted for training the GAN. This addressed the unique challenge of imputing missing sensory evaluation scores, particularly relevant since most datasets primarily included observations of polyphenols and caffeine.
In this phase, individual sensory scores were aggregated into an 'Overall Sensory Score.' This simplification aimed to streamline the modeling process and provide a unified target variable for prediction, enhancing both the model building and the user interaction with the model.
To normalize the dataset and prepare it for model training, Min-Max scaling was applied to both chemical compositions and sensory scores separately. This step was crucial for ensuring that the data was in a suitable format for effective model training and analysis.
Although PCA was used for data analysis, it was ultimately excluded from direct application in the user interface (UI) portion of the project. This decision was motivated by the desire to maintain the transparency and interpretability of the original features for end-users, ensuring that the information remained accessible and comprehensible.
I developed and trained three different models:
- Random Forest: A robust ensemble method known for its high accuracy and ability to handle non-linear data.
- Multilayer Perceptron (MLP): A class of feedforward artificial neural network that can model complex relationships between inputs and outputs.
- Recurrent Neural Network (RNN): An advanced neural network architecture that can capture temporal dynamic behavior, suitable for sequence prediction tasks.
Each model's performance was evaluated based on mean squared error (MSE) between the predicted and actual sensory scores.
Through evaluation, the Multilayer Perceptron (MLP) model delivered the most accurate predictions of the overall sensory score, closely followed by the Random Forest and then the Recurrent Neural Network (RNN) model. The visualization of the models' predictions facilitated a comparison of their performance and highlighted the complex relationship between the chemical composition of tea and its sensory appeal.
This study demonstrates the significant promise of machine learning techniques in forecasting the sensory appeal of tea from its chemical composition. The MLP model's consistent accuracy and its near-perfect predictions underscore its aptness for mapping the intricate links between tea's chemical makeup and its sensory appeal.
This study marks a significant advancement in the field by introducing a novel methodology for predicting the sensory appeal of tea based on its chemical composition. By addressing the challenge of limited data availability through cutting-edge data imputation and generation techniques, our approach provides a valuable tool for the agricultural industry.
- Scikit-learn Pipeline
- Scikit-learn FunctionTransformer
- Seaborn Pairplot
- PyTorch GPU Usage Check
- Scikit-learn SimpleImputer
- Kernel Density Estimation Explained
- PyTorch DCGAN Tutorial
- GAN Simple Implementation with PyTorch
- Google Developers GAN Structure
- PyTorch-GAN by Erik Linder-Norén
- Deep Learning for Computer Vision: PyTorch GAN
- Scikit-learn Kernel PCA Explained Variance
- Scikit-learn Kernel PCA
- PyTorch MLP Documentation
- Building Multilayer Perceptron Models in PyTorch
- PyTorch Image Classification with MLP
- Reset Parameters of a Neural Network in PyTorch
- Introduction to PyTorch Training
- PyTorch RNN Documentation
- A Beginner's Guide on Recurrent Neural Networks with PyTorch
- Visualizing Models, Data, and Training with PyTorch
- Super Resolution with ONNXRuntime
- skl2onnx: Convert Scikit-learn models to ONNX
- Bootstrap Card Components
- Tailwind CSS Documentation
- Configuring HTTPS in AWS Elastic Beanstalk
- Handling Non-JSON-Serializable Data
- Journal of Agricultural and Food Chemistry
- Antioxidants
- Journal of Food Composition and Analysis
- ICAR
- Journal of Chromatography A
- Journal of Agricultural and Food Chemistry
- Food Chemistry
- Nutritional Cancer
- Food Chemistry
- Journal of Dietary Supplements
- Phenol-Explorer
- Food Chemistry
- Journal of Food Science
- Food Science & Nutrition
- Molecules