Hi! We’re currently researching the code smells in machine learning projects in the industry context and looking for feedback for
dslinter
! It would be a massive help if you could rundslinter
on your machine learning project in an industry setting and send the generated txt file to [email protected] . The steps and commands can be found here and it should take no more than 10 minutes. Feel free to send me an email if you want to go through the process together. The process is anonymous and we will remove any sensitive information before the results are published. Many thanks!
dslinter
is a PyLint plugin for linting data science and machine learning code. It aims to help developers ensure the machine learning code quality and supports the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.
dslinter
implements the detection rules for smells identified by our previous work. The smells are collected from papers, grey literature, GitHub commits, and Stack Overflow posts. The smells are also elaborated at a website :)
demo.mov
The example project in the demo video can be found here.
To install from the Python Package Index:
pip install dslinter
pylint --load-plugins=dslinter <other_options> <path_to_sources>
Or place a .pylintrc
configuration file which contains above settings in the folder where you run your command on, and run:
pylint <path_to_sources>
[For Linux/Mac OS Users]:
pylint \
--load-plugins=dslinter \
--disable=all \
--enable=import,unnecessary-iteration-pandas,unnecessary-iteration-tensorflow,\
nan-numpy,chain-indexing-pandas,\
merge-parameter-pandas,\
dataframe-conversion-pandas,scaler-missing-scikitlearn,hyperparameters-scikitlearn,\
hyperparameters-tensorflow,hyperparameters-pytorch,memory-release-tensorflow,\
deterministic-pytorch,randomness-control-numpy,randomness-control-scikitlearn,\
randomness-control-tensorflow,randomness-control-pytorch,randomness-control-dataloader-pytorch,\
missing-mask-tensorflow,missing-mask-pytorch,tensor-array-tensorflow,\
forward-pytorch,pipeline-not-used-scikitlearn,\
dependent-threshold-scikitlearn,dependent-threshold-tensorflow,dependent-threshold-pytorch \
--output-format=text:report.txt,colorized \
--reports=y \
<path_to_sources>
[For Windows Users]:
pylint --load-plugins=dslinter --disable=all --enable=import,unnecessary-iteration-pandas,unnecessary-iteration-tensorflow,nan-numpy,chain-indexing-pandas,merge-parameter-pandas,dataframe-conversion-pandas,scaler-missing-scikitlearn,hyperparameters-scikitlearn,hyperparameters-tensorflow,hyperparameters-pytorch,memory-release-tensorflow,deterministic-pytorch,randomness-control-numpy,randomness-control-scikitlearn,randomness-control-tensorflow,randomness-control-pytorch,randomness-control-dataloader-pytorch,missing-mask-tensorflow,missing-mask-pytorch,tensor-array-tensorflow,forward-pytorch,pipeline-not-used-scikitlearn,dependent-threshold-scikitlearn,dependent-threshold-tensorflow,dependent-threshold-pytorch --output-format=text:report.txt,colorized --reports=y <path_to_sources>
Or place a .pylintrc
configuration file which contains above settings in the folder where you run your command on, and run:
pylint <path_to_sources>
Contributions are welcome! If you want to contribute, please see the following steps:
- fork the repository and clone the repository you forked.
git clone https://github.com/your-github-account/dslinter.git
git submodule update --init --recursive
dslinter
usespoetry
to manage dependencies, so you will need to installpoetry
first and then install the dependencies.
pip install poetry
poetry install
- To install
dslinter
from source for development purposes, install it with:
poetry build
pip install ./dist/dslinter-version.tar.gz
- Assign yourself to the issue you want to solve. If you identify a new issue that needs to be solved, feel free to open a new issue.
- Make changes to the repository and run the tests. To run the tests using pytest:
poetry run pytest .
- Make a pull request. The pull request is expected to pass the tests. :)
-
C5501 - C5506 | import | Import Checker: Check whether data science modules are imported using the correct naming conventions.
-
R5501 | unnecessary-iteration-pandas | Unnecessary Iteration Checker(Pandas): Vectorized solutions are preferred over iterators for DataFrames. If iterations are used while there are vectorized APIs can be used, the rule is violated.
-
W5501 | dataframe-iteration-modification-pandas | Unnecessary Iteration Checker(Pandas): A dataframe where is iterated over should not be modified. If the dataframe is modified during iteration, the rule is violated.
-
R5502 | unnecessary-iteration-tensorflow | Unnecessary Iteration Checker(TensorFlow): If there is any augment assignment operation in the loop, the rule is violated. Augment assignment in the loop can be replaced with vectorized solution in TensorFlow APIs.
-
E5501 | nan-numpy | Nan Equality Checker(NumPy): Values cannot be compared with np.nan, as
np.nan != np.nan
. -
W5502 | chain-indexing-pandas | Chain Indexing Checker(Pandas): Chain indexing is considered bad practice in pandas code and should be avoided. If chain indexing is used on a pandas dataframe, the rule is violated.
-
R5503 | datatype-pandas | Datatype Checker(Pandas): Datatype should be set when a dataframe is imported from data to ensure the data formats are imported as expected. If the datatype is not set when importing, the rule is violated.
-
R5504 | column-selection-pandas | Column Selection Checker(Pandas): Column should be selected after the dataframe is imported for better elaborating what to be expected in the downstream.
-
R5505 | merge-parameter-pandas | Merge Parameter Checker(Pandas): Parameters 'how', 'on' and 'validate' should be set for merge operations to ensure the correct usage of merging.
-
W5503 | inplace-pandas | InPlace Checker(Pandas): Operations on DataFrames return new DataFrames, and they should be assigned to a variable. Otherwise the result will be lost, and the rule is violated. Operations from the whitelist and with
in_place
parameter set are excluded. -
W5504 | dataframe-conversion-pandas | Dataframe Conversion Checker(Pandas): For dataframe conversion in pandas code, use .to_numpy() instead of .values. If .values is used in pandas code, the rule is violated.
-
W5505 | scaler-missing-scikitlearn | Scaler Missing Checker(ScikitLearn): Check whether the scaler is used before every scaling-sensitive operation in scikit-learn codes. Scaling-sensitive operations includes Principal Component Analysis (PCA), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), Multi-layer Perceptron classifier and L1 and L2 regularization.
-
R5506 | hyperparameters-scikitlearn | Hyperparameter Checker(ScikitLearn): For scikit-learn learning algorithms, some important hyperparameters should be set.
-
R5507 | hyperparameter-tensorflow | Hyperparameter Checker(TensorFlow): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.
-
R5508 | hyperparameter-pytorch | Hyperparameter Checker(PyTorch): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.
-
W5506 | memory-release-tensorflow | Memory Release Checker(TensorFlow): If a neural network is created in the loop, and no memory clear operation is used, the rule is violated.
-
W5507 | deterministic-pytorch | Deterministic Algorithm Usage Checker(PyTorch): If use_deterministic algorithm is not used in a pytorch program, the rule is violated.
-
W5508 | randomness-control-numpy | Randomness Control Checker(NumPy): The np.seed() should be used to preserve reproducibility in a machine learning program.
-
W5509 | randomness-control-scikitlearn | Randomness Control Checker(ScikitLearn): For reproducible results across executions, remove any use of random_state=None in scikit-learn estimators.
-
W5510 | randomness-control-tensorflow | Randomness Control Checker(TensorFlow): The tf.random.set_seed() should be used to preserve reproducibility in a Tensorflow program.
-
W5511 | randomness-control-pytorch | Randomness Control Checker(PyTorch): The torch.manual_seed() should be used to preserve reproducibility in a Tensorflow program.
-
W5512 | randomness-control-dataloader-pytorch | Randomness Control Checker(PyTorch-Dataloader): The worker_init_fn() and generator should be set in dataloader to preserve reproducibility. If they're not set, the rule is violated.
-
W5513 | missing-mask-tensorflow | Mask Missing Checker(TensorFlow): If log function is used in the code, check whether the argument value is valid.
-
W5514 | missing-mask-pytorch | Mask Missing Checker(PyTorch): If log function is used in the code, check whether the argument value is valid.
-
W5515 | tensor-array-tensorflow | Tensor Array Checker(Tensorflow): Use tf.TensorArray() for growing array in the loop.
-
W5516 | forward-pytorch | Net Forward Checker(PyTorch): It is recommended to use self.net() rather than self.net.forward() in PyTorch code. If self.net.forward() is used in the code, the rule is violated.
-
W5517 | gradient-clear-pytorch | Gradient Clear Checker(PyTorch): The loss_fn.backward() and optimizer.step() should be used together with optimizer.zero_grad(). If the
.zero_grad()
is missing in the code, the rule is violated. -
W5518 | pipeline-not-used-scikitlearn | Pipeline Checker(ScikitLearn): All scikit-learn estimators should be used inside Pipelines, to prevent data leakage between training and test data.
-
W5519 | dependent-threshold-scikitlearn | Dependent Threshold Checker(TensorFlow): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
-
W5520 | dependent-threshold-tensorflow | Dependent Threshold Checker(PyTorch): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
-
W5521 | dependent-threshold-pytorch | Dependent Threshold Checker(ScikitLearn): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
The dslinter is developed by Mark Haakman and Haiyin Zhang during our master theses at the Software Engineering Research Group (SERG) at TU Delft and ING's AI for FinTech Research Lab, supervised by Luís Cruz and Arie van Deursen.
Maintainer: Haiyin Zhang [[email protected]].