Skip to content

Commit

Permalink
merge dev for release 0.9.4 (#177)
Browse files Browse the repository at this point in the history
* merge python3_10 integration

* improve codecov script

* improve codecov script with test verbosity

* improve codecov script with test verbosity

* add script to run tests on all python version supported

* fix path management

* change mod of executable file

* add interactive shell to handle conda

* remove shebang arg

* fix contributing and minor typo in tests script

* improv example code and remove dead example

* squash handling from url branch

* cleanup dead file

* improve speed test code

* add num_workers test fasttext under windows os condition

* add tests case for num_workers test in parser

* simplified tests case windows

* update changelog

* fix windows os failing test due to num workers gt 0

* fix missing lower cassing windows os name

* add missing downlaod_from_url deprecated message and redirect to new refactored function

* add major release todo list to track function to remove

* update changelog

* add pragma no cover to skip codecovv

* improve variable naming

* refactor position of non protected method

* bump pylint and add django for codacy

* fix deepparse tools pylint

* fix network pylint

* fix vectorizer modules

* fix torch member and parser modules

* refactor arguments init in cli and cycling import

* fix circular import

* fix last pylint errors

* fix error in csv column names versus column name

* fix list csv column names missing nargs

* remove duplicate detection and fix with statement for temporary directory

* fix oylint on test

* push to 0.8.1

* simplification skipif test testing

* bug fix issue 141

* fix missing csv dataset in test for csv integration test

* merge improvement for error handling of retrain and test API

* linting yml file

* improve run all tests script

* improve run tests python envs

* fix naming of tests and some typos

* add save_model_eights method (#147)

* bumb actions version (checkout and setup-python

* fixed actions/checkout setted to 4 instead of 3

* add dependabot

* bump stale to v5

* add python 3.11 in linting

* remove python 3.11 since not supported for now and add 3.10 in windows test to see if still fails

* revert windoes python 3.10 since still fail

* Add codeql (#148)

* Create FUNDING.yml

* Update README.md

* Update FUNDING.yml

* Create codeql-analysis.yml

* add deprecated warnings class type on deprecated download_from_url_fn

* refactored dataset containter creation into a factory

* fix errors for parsing cases

* moved arguments in dataset factory

* add tests case for new factory tool fn

* added val dataset handling

* fixed tests and remove major release todo

* added cleaning conda env

* improved scirpt with warmup training

* remove fine_tuning script since in branch

* fixed tests

* fixed test without clear num_workers arg

* remove fn download_from_url

* removed unecessary retrain in test api tests

* added verbose for test and improved tests for retrain test integration

* updated changelog

* fixed missing hint typing, improved internal doc, fixed train_ratio arg error in code examples and in doc

* add pylint step on code examples

* added missing typing, uniformization of assertFileExist fn, added integration test and improved doc

* remove comment in linting ci to bug fix if failling problem

* fix dead verbose retrain api flag

* add ini option for django

* remove linting of code example since fail due to pylint-django and I am unable to make it work

* fixed django settings

* add steps to install depparse for code examples linting

* remove install -e

* reinstaller install -e .

* add skip=no-member since it is mostly flase positive

* removed no-member pylint disable

* add docker image

* formating

* formated README

* update changelog

* merge uk example and fixes to doc

* hot-fix choices handling in cli.download

* linting and security template mv

* improved deepparse server error handling

* merge offline parsing

* fix typo in all test run

* fixed error in module name and refactored errors module

* fixed reference packaging other deepparse module

* added missing hint typing

* add missing urllib3 dependancies

* improve workflow

* improve doc

* add download_models, fix bug in cache path handling and fixed examples

* update changelog

* refactored test and add download_models tests

* merge refactoring of download cli fn

* moved code for licensing

* fixed typo in doc

* Update CHANGELOG.md

* added factories and tests

* added offline argument to model factory

* added data padders & tests

* black formatting

* added data padder factory & tests

* added docstring & preparing to refactor padder

* refactored data padder to solve LSP issue

* refactored vectorizer factory & temporarily removed type hinting from TrainVectorizer due to cyclic import

* adjusted docstring

* Hotfix `SSLError` when downloading model weights of model type: `bpemb` (#157)

* ✨ add `no_ssl_verification()` context manager

disables SSL for requests library within context

* 🐛 hotfix model factory for `model_type="bpemb"`

Co-authored-by: David Beauchemin <[email protected]>

* moved context wrapper in bpemb embedding model

* removed unused as err

* added pylint skip for broad except to hotfix code

* added pylint skip for broad except to hotfix code

* bump version and changelog

* added DataPadder docstring

* applied refurb (#160)

* wip - added DataProcessor and tests

* tweaked process_for_training method

* finished DataProcessor and tests

* removed obsolete tests

* added DataProcessor docstring

* Bump docker/metadata-action from 4.0.1 to 4.1.1 (#161)

Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 4.0.1 to 4.1.1.
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Commits](docker/metadata-action@69f6fc9...5739616)

---
updated-dependencies:
- dependency-name: docker/metadata-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/login-action from 2.0.0 to 2.1.0 (#162)

Bumps [docker/login-action](https://github.com/docker/login-action) from 2.0.0 to 2.1.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](docker/login-action@49ed152...f4ef78c)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pylint from 2.15.3 to 2.15.5 (#163)

Bumps [pylint](https://github.com/PyCQA/pylint) from 2.15.3 to 2.15.5.
- [Release notes](https://github.com/PyCQA/pylint/releases)
- [Commits](pylint-dev/pylint@v2.15.3...v2.15.5)

---
updated-dependencies:
- dependency-name: pylint
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/build-push-action from 3.1.1 to 3.2.0 (#164)

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 3.1.1 to 3.2.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@c84f382...c56af95)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump black from 22.8.0 to 22.10.0 (#165)

Bumps [black](https://github.com/psf/black) from 22.8.0 to 22.10.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](psf/black@22.8.0...22.10.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Beauchemin <[email protected]>

* fix black dependancy pyproject.toml

* added DataProcessorFactory and tests

* fix error in arg train ratio example and added assert in deepparse.retrain to be more verbose

* added error handling for macos and improved windows for case of num_worker and multiprocessing

* fixed failling test and improved test for test_api

* fixed windows tests

* Update CHANGELOG.md

* Feat/add new tags to retrain cli (#167)

* add missing import in init

* add feature to allow new_prediction_tags in retrain CLI API

* bump version and changelog

* fix typo in doc retrain CLI

* fixed errors due to model naming conventions

* added final docstring

* fixed broken tests

* removed broken test patching

* cleaned-up parser after new changes integration

* black formatting

* remove accidental unused import

* fixed linting

* black formatting

* removed unnecessary args

* patching factories in AddressParser tests to memory optimise

* fixed brocken tests

* removed unused import

* fixed windows tests

* fixed windows test

* removed unused modules after refactor

* removed imports for removed modules

* add tensorboard dependancies in test/requirements since it make test fail due to missing tensorboard for Poutyne import

* Update deepparse/parser/address_parser.py

Co-authored-by: David Beauchemin <[email protected]>

* added error handling to data processor factory

* fixed linting

* Update deepparse/converter/data_processor_factory.py

* fixed broken tests

* fixed broken test

* Update CHANGELOG.md

* Bump docker/metadata-action from 4.1.1 to 4.3.0 (#173)

Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 4.1.1 to 4.3.0.
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Commits](docker/metadata-action@5739616...507c2f2)

---
updated-dependencies:
- dependency-name: docker/metadata-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Beauchemin <[email protected]>

* Bump pylint from 2.15.9 to 2.15.10 (#174)

Bumps [pylint](https://github.com/PyCQA/pylint) from 2.15.9 to 2.15.10.
- [Release notes](https://github.com/PyCQA/pylint/releases)
- [Commits](pylint-dev/pylint@v2.15.9...v2.15.10)

---
updated-dependencies:
- dependency-name: pylint
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Beauchemin <[email protected]>

* Bump docker/build-push-action from 3.2.0 to 4.0.0 (#175)

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 3.2.0 to 4.0.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@c56af95...3b5e802)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Beauchemin <[email protected]>

* Bump black from 22.12.0 to 23.1.0 (#176)

* Bump black from 22.12.0 to 23.1.0

Bumps [black](https://github.com/psf/black) from 22.12.0 to 23.1.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](psf/black@22.12.0...23.1.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

* bump pyproject.toml

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: David Beauchemin <[email protected]>

* bump version

* black formatting

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Marouane Yassine <[email protected]>
Co-authored-by: Ajinkya Indulkar <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marouane Yassine <[email protected]>
  • Loading branch information
5 people authored Feb 20, 2023
1 parent e93e0ef commit 640ed90
Show file tree
Hide file tree
Showing 40 changed files with 2,554 additions and 2,194 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ jobs:

- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@57396166ad8aefe6098280995947635806a0e6ea
uses: docker/metadata-action@507c2f2dc502c992ad446e3d7a5dfbe311567a96
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

- name: Build and push Docker image
uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
uses: docker/build-push-action@3b5e8027fcad23fda98b2e3ac259d8d67585f671
with:
context: .
push: true
Expand Down
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,6 @@
- Add Zenodo DOI

## 0.9
-

- Add `save_model_weights` method to `AddressParser` to save model weights (PyTorch state dictionary)
- Improve CI
Expand Down Expand Up @@ -288,4 +287,8 @@
- Bug-fix FastText error not handled in test API.
- Add feature to allow new_prediction_tags to retrain CLI.

## 0.9.4

- Improve codebase.

## dev
3 changes: 0 additions & 3 deletions deepparse/comparer/formatted_compared_addresses_raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,16 +68,13 @@ def _comparison_report_builder(self) -> str:
str_formatted += "Parsed address: " + repr(self.first_address) + "\n"
str_formatted += str(probs[0]) + "\n"
if not self.identical:

str_formatted += "\nParsed address: " + repr(self.second_address) + "\n"
str_formatted += str(probs[1]) + "\n"

if self.equivalent:

str_formatted += "\n\nRaw differences between the two addresses: \n"
str_formatted += self._get_raw_diff_color()
else:

str_formatted += "\n\nAddresses tags differences between the two addresses: \n"
str_formatted += self._get_tags_diff_color()

Expand Down
5 changes: 3 additions & 2 deletions deepparse/converter/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# pylint: disable=wildcard-import
from .data_padding import *
from .target_converter import *
from .data_transform import *
from .data_padder import *
from .data_processor import *
from .data_processor_factory import *
204 changes: 204 additions & 0 deletions deepparse/converter/data_padder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
from typing import List, Tuple, Union

import torch
from torch.nn.utils.rnn import pad_sequence
import numpy as np


class DataPadder:
"""
Class that handles the padding of vectorized sequences to the length of the longuest sequence.
Args:
padding_value (int): the value to use as padding to extend the shorter sequences. Default: -100.
"""

def __init__(self, padding_value: int = -100) -> None:
self.padding_value = padding_value

def pad_word_embeddings_batch(
self, batch: List[Tuple[List, List]], teacher_forcing: bool = False
) -> Union[
Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor],
Tuple[Tuple[torch.Tensor, torch.Tensor, torch.Tensor], torch.Tensor],
]:
"""
Method to pad a batch of word embeddings sequences and their targets to the length of the longuest one.
Args:
batch (List[Tuple[List, List]]): a list of tuples where the first element is a list
of word embeddings (the sequence) and the second is a list of targets.
teacher_forcing (bool): if True, the padded target vectors are returned twice,
once with the sequences and their lengths, and once on their own. This enables
the use of teacher forcing during the training of sequence to sequence models.
Return:
A tuple of two elements:
- a tuple containing either two :class:`~torch.Tensor` (the padded sequences and their
repective original lengths),or three :class:`~torch.Tensor` (the padded sequences
and their lengths, as well as the padded targets) if `teacher_forcing` is true.
For details on the padding of sequences,
check out :meth:`~DataPadder.pad_word_embeddings_sequences` below.
The returned sequences are sorted in decreasing order.
- a :class:`~torch.Tensor` containing the padded targets.
"""
sequences_vectors, target_vectors = self._extract_word_embeddings_sequences_and_target(batch)

padded_sequences, lengths = self.pad_word_embeddings_sequences(sequences_vectors)
padded_target_vectors = self.pad_targets(target_vectors)

if teacher_forcing:
return (padded_sequences, lengths, padded_target_vectors), padded_target_vectors

return (padded_sequences, lengths), padded_target_vectors

def pad_word_embeddings_sequences(self, sequences_batch: Tuple[List, ...]) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Method to pad a batch of word embeddings sequences.
Args:
seuqnces_batch (Tuple[List, ...]): a tuple containing lists of word embeddings (the sequences)
Return:
A tuple of two elements:
- a :class:`~torch.Tensor` containing the padded sequcences.
- a :class:`~torch.Tensor` containing the respective original lengths of the padded sequences.
"""
sequences_vectors, lengths = zip(
*[
(
torch.FloatTensor(np.array(seq_vectors)),
len(seq_vectors),
)
for seq_vectors in sequences_batch
]
)

lengths = torch.tensor(lengths)

padded_sequences_vectors = self._pad_tensors(sequences_vectors)

return padded_sequences_vectors, lengths

def pad_subword_embeddings_batch(
self, batch: List[Tuple[Tuple[List, List], List]], teacher_forcing: bool = False
) -> Union[
Tuple[Tuple[torch.Tensor, List, torch.Tensor], torch.Tensor],
Tuple[Tuple[torch.Tensor, List, torch.Tensor, torch.Tensor], torch.Tensor],
]:
"""
Method to pad a batch of subword embeddings sequences and their targets to the length of the longuest one.
Args:
batch (List[Tuple[Tuple[List, List], List]]): a list of tuples containing the two following elements:
- a tuple where the first element is a list of words represented as subword embeddings and the
second element is a list of the number of subword embeddings that each word is decomposed into.
- a list of targets.
teacher_forcing (bool): if True, the padded target vectors are returned twice,
once with the sequences and their lengths, and once on their own. This enables
the use of teacher forcing during the training of sequence to sequence models.
Return:
A tuple of two elements:
- A tuple (``x``, ``y`` , ``z``). The element ``x`` is a :class:`~torch.Tensor` of
padded subword vectors,``y`` is a list of padded decomposition lengths,
and ``z`` is a :class:`~torch.Tensor` of the original lengths of the sequences
before padding. If teacher_forcing is True, a fourth element is added which
corresponds to a :class:`~torch.Tensor` of the padded targets. For details
on the padding of sequences, check out :meth:`~DataPadder.pad_subword_embeddings_sequences` below.
The returned sequences are sorted in decreasing order.
- a :class:`~torch.Tensor` containing the padded targets.
"""
sequences_tuples, target_vectors = self._extract_subword_embeddings_sequences_and_targets(batch)

padded_sequences, decomposition_lengths, sequence_lengths = self.pad_subword_embeddings_sequences(
sequences_tuples
)
padded_target_vectors = self.pad_targets(target_vectors)

if teacher_forcing:
return (
padded_sequences,
decomposition_lengths,
sequence_lengths,
padded_target_vectors,
), padded_target_vectors

return (padded_sequences, decomposition_lengths, sequence_lengths), padded_target_vectors

def pad_subword_embeddings_sequences(
self, sequences_batch: Tuple[Tuple[List, List], ...]
) -> Tuple[torch.Tensor, List, torch.Tensor]:
"""
Method to pad a batch of subword embeddings sequences.
Args:
sequences_batch (Tuple[Tuple[List, List], ...]): a tuple containing tuples of two elements:
- a list of lists representing words as lists of subword embeddings.
- a list of the number of subword embeddings that each word is decomposed into.
Return:
A tuple of three elements:
- a :class:`~torch.Tensor` containing the padded sequcences.
- a list containing the padded decomposition lengths of each word. When a word is
added as padding to elongate a sequence, we consider that the decomposition
length of the added word is 1.
- a :class:`~torch.Tensor` containing the respective original lengths (number of words)
of the padded sequences.
"""
sequences_vectors, decomp_len, lengths = zip(
*[
(
torch.tensor(np.array(vectors)),
word_decomposition_len,
len(vectors),
)
for vectors, word_decomposition_len in sequences_batch
]
)

padded_sequences_vectors = self._pad_tensors(sequences_vectors)

lengths = torch.tensor(lengths)
max_sequence_length = lengths.max().item()
for decomposition_length in decomp_len:
if len(decomposition_length) < max_sequence_length:
decomposition_length.extend([1] * (max_sequence_length - len(decomposition_length)))

return padded_sequences_vectors, list(decomp_len), lengths

def pad_targets(self, target_batch: Tuple[List, ...]) -> torch.Tensor:
"""
Method to pad a batch of target indices to the longuest one.
Args:
target_batch (Tuple[List, ...]): a tuple comtaining lists of target indices.
Return:
A :class:`~torch.Tensor` of padded targets.
"""
target_batch = map(torch.tensor, target_batch)

return self._pad_tensors(target_batch)

def _extract_word_embeddings_sequences_and_target(self, batch: List[Tuple[List, List]]) -> Tuple[List, List]:
"""
Method that takes a list of word embedding sequences and targets and zips the
sequences together and the targets together.
"""
sorted_batch = sorted(batch, key=lambda x: len(x[0]), reverse=True)

sequence_batch, target_batch = zip(*sorted_batch)

return sequence_batch, target_batch

def _extract_subword_embeddings_sequences_and_targets(
self, batch: List[Tuple[Tuple[List, List], List]]
) -> Tuple[List[Tuple[List, List]], List]:
"""
Method that takes a list of subword embedding sequences and targets
and zips the sequences together and the targets together.
"""
sorted_batch = sorted(batch, key=lambda x: len(x[0][1]), reverse=True)

sequence_batch, target_batch = zip(*sorted_batch)

return sequence_batch, target_batch

def _pad_tensors(self, sequences_batch: Tuple[torch.Tensor, ...]) -> torch.Tensor:
"""
A method to pad and collate multiple :class:``torch.Tensor` representing sequences
into a single :class:``torch.Tensor`using :attr:`DataPadder.padding_value`.
The final :class:``torch.Tensor` is returned with batch first
"""

return pad_sequence(sequences_batch, batch_first=True, padding_value=self.padding_value)
Loading

0 comments on commit 640ed90

Please sign in to comment.