merge dev for release 0.9.4 (#177)

* merge python3_10 integration * improve codecov script * improve codecov script with test verbosity * improve codecov script with test verbosity * add script to run tests on all python version supported * fix path management * change mod of executable file * add interactive shell to handle conda * remove shebang arg * fix contributing and minor typo in tests script * improv example code and remove dead example * squash handling from url branch * cleanup dead file * improve speed test code * add num_workers test fasttext under windows os condition * add tests case for num_workers test in parser * simplified tests case windows * update changelog * fix windows os failing test due to num workers gt 0 * fix missing lower cassing windows os name * add missing downlaod_from_url deprecated message and redirect to new refactored function * add major release todo list to track function to remove * update changelog * add pragma no cover to skip codecovv * improve variable naming * refactor position of non protected method * bump pylint and add django for codacy * fix deepparse tools pylint * fix network pylint * fix vectorizer modules * fix torch member and parser modules * refactor arguments init in cli and cycling import * fix circular import * fix last pylint errors * fix error in csv column names versus column name * fix list csv column names missing nargs * remove duplicate detection and fix with statement for temporary directory * fix oylint on test * push to 0.8.1 * simplification skipif test testing * bug fix issue 141 * fix missing csv dataset in test for csv integration test * merge improvement for error handling of retrain and test API * linting yml file * improve run all tests script * improve run tests python envs * fix naming of tests and some typos * add save_model_eights method (#147) * bumb actions version (checkout and setup-python * fixed actions/checkout setted to 4 instead of 3 * add dependabot * bump stale to v5 * add python 3.11 in linting * remove python 3.11 since not supported for now and add 3.10 in windows test to see if still fails * revert windoes python 3.10 since still fail * Add codeql (#148) * Create FUNDING.yml * Update README.md * Update FUNDING.yml * Create codeql-analysis.yml * add deprecated warnings class type on deprecated download_from_url_fn * refactored dataset containter creation into a factory * fix errors for parsing cases * moved arguments in dataset factory * add tests case for new factory tool fn * added val dataset handling * fixed tests and remove major release todo * added cleaning conda env * improved scirpt with warmup training * remove fine_tuning script since in branch * fixed tests * fixed test without clear num_workers arg * remove fn download_from_url * removed unecessary retrain in test api tests * added verbose for test and improved tests for retrain test integration * updated changelog * fixed missing hint typing, improved internal doc, fixed train_ratio arg error in code examples and in doc * add pylint step on code examples * added missing typing, uniformization of assertFileExist fn, added integration test and improved doc * remove comment in linting ci to bug fix if failling problem * fix dead verbose retrain api flag * add ini option for django * remove linting of code example since fail due to pylint-django and I am unable to make it work * fixed django settings * add steps to install depparse for code examples linting * remove install -e * reinstaller install -e . * add skip=no-member since it is mostly flase positive * removed no-member pylint disable * add docker image * formating * formated README * update changelog * merge uk example and fixes to doc * hot-fix choices handling in cli.download * linting and security template mv * improved deepparse server error handling * merge offline parsing * fix typo in all test run * fixed error in module name and refactored errors module * fixed reference packaging other deepparse module * added missing hint typing * add missing urllib3 dependancies * improve workflow * improve doc * add download_models, fix bug in cache path handling and fixed examples * update changelog * refactored test and add download_models tests * merge refactoring of download cli fn * moved code for licensing * fixed typo in doc * Update CHANGELOG.md * added factories and tests * added offline argument to model factory * added data padders & tests * black formatting * added data padder factory & tests * added docstring & preparing to refactor padder * refactored data padder to solve LSP issue * refactored vectorizer factory & temporarily removed type hinting from TrainVectorizer due to cyclic import * adjusted docstring * Hotfix `SSLError` when downloading model weights of model type: `bpemb` (#157) * ✨ add `no_ssl_verification()` context manager disables SSL for requests library within context * 🐛 hotfix model factory for `model_type="bpemb"` Co-authored-by: David Beauchemin <[email protected]> * moved context wrapper in bpemb embedding model * removed unused as err * added pylint skip for broad except to hotfix code * added pylint skip for broad except to hotfix code * bump version and changelog * added DataPadder docstring * applied refurb (#160) * wip - added DataProcessor and tests * tweaked process_for_training method * finished DataProcessor and tests * removed obsolete tests * added DataProcessor docstring * Bump docker/metadata-action from 4.0.1 to 4.1.1 (#161) Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 4.0.1 to 4.1.1. - [Release notes](https://github.com/docker/metadata-action/releases) - [Commits](docker/metadata-action@69f6fc9...5739616) --- updated-dependencies: - dependency-name: docker/metadata-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker/login-action from 2.0.0 to 2.1.0 (#162) Bumps [docker/login-action](https://github.com/docker/login-action) from 2.0.0 to 2.1.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](docker/login-action@49ed152...f4ef78c) --- updated-dependencies: - dependency-name: docker/login-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pylint from 2.15.3 to 2.15.5 (#163) Bumps [pylint](https://github.com/PyCQA/pylint) from 2.15.3 to 2.15.5. - [Release notes](https://github.com/PyCQA/pylint/releases) - [Commits](pylint-dev/pylint@v2.15.3...v2.15.5) --- updated-dependencies: - dependency-name: pylint dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker/build-push-action from 3.1.1 to 3.2.0 (#164) Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 3.1.1 to 3.2.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@c84f382...c56af95) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump black from 22.8.0 to 22.10.0 (#165) Bumps [black](https://github.com/psf/black) from 22.8.0 to 22.10.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@22.8.0...22.10.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: David Beauchemin <[email protected]> * fix black dependancy pyproject.toml * added DataProcessorFactory and tests * fix error in arg train ratio example and added assert in deepparse.retrain to be more verbose * added error handling for macos and improved windows for case of num_worker and multiprocessing * fixed failling test and improved test for test_api * fixed windows tests * Update CHANGELOG.md * Feat/add new tags to retrain cli (#167) * add missing import in init * add feature to allow new_prediction_tags in retrain CLI API * bump version and changelog * fix typo in doc retrain CLI * fixed errors due to model naming conventions * added final docstring * fixed broken tests * removed broken test patching * cleaned-up parser after new changes integration * black formatting * remove accidental unused import * fixed linting * black formatting * removed unnecessary args * patching factories in AddressParser tests to memory optimise * fixed brocken tests * removed unused import * fixed windows tests * fixed windows test * removed unused modules after refactor * removed imports for removed modules * add tensorboard dependancies in test/requirements since it make test fail due to missing tensorboard for Poutyne import * Update deepparse/parser/address_parser.py Co-authored-by: David Beauchemin <[email protected]> * added error handling to data processor factory * fixed linting * Update deepparse/converter/data_processor_factory.py * fixed broken tests * fixed broken test * Update CHANGELOG.md * Bump docker/metadata-action from 4.1.1 to 4.3.0 (#173) Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 4.1.1 to 4.3.0. - [Release notes](https://github.com/docker/metadata-action/releases) - [Commits](docker/metadata-action@5739616...507c2f2) --- updated-dependencies: - dependency-name: docker/metadata-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: David Beauchemin <[email protected]> * Bump pylint from 2.15.9 to 2.15.10 (#174) Bumps [pylint](https://github.com/PyCQA/pylint) from 2.15.9 to 2.15.10. - [Release notes](https://github.com/PyCQA/pylint/releases) - [Commits](pylint-dev/pylint@v2.15.9...v2.15.10) --- updated-dependencies: - dependency-name: pylint dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: David Beauchemin <[email protected]> * Bump docker/build-push-action from 3.2.0 to 4.0.0 (#175) Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 3.2.0 to 4.0.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@c56af95...3b5e802) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: David Beauchemin <[email protected]> * Bump black from 22.12.0 to 23.1.0 (#176) * Bump black from 22.12.0 to 23.1.0 Bumps [black](https://github.com/psf/black) from 22.12.0 to 23.1.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@22.12.0...23.1.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> * bump pyproject.toml --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: David Beauchemin <[email protected]> * bump version * black formatting --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Marouane Yassine <[email protected]> Co-authored-by: Ajinkya Indulkar <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marouane Yassine <[email protected]>
GRAAL-Research · Feb 20, 2023 · 640ed90 · 640ed90
1 parent e93e0ef
commit 640ed90
Show file tree

Hide file tree

Showing 40 changed files with 2,554 additions and 2,194 deletions.
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -28,12 +28,12 @@ jobs:
 
       - name: Extract metadata (tags, labels) for Docker
         id: meta
-        uses: docker/metadata-action@57396166ad8aefe6098280995947635806a0e6ea
+        uses: docker/metadata-action@507c2f2dc502c992ad446e3d7a5dfbe311567a96
         with:
           images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
 
       - name: Build and push Docker image
-        uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
+        uses: docker/build-push-action@3b5e8027fcad23fda98b2e3ac259d8d67585f671
         with:
           context: .
           push: true

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -256,7 +256,6 @@
 - Add Zenodo DOI
 
 ## 0.9
-- 
 
 - Add `save_model_weights` method to `AddressParser` to save model weights (PyTorch state dictionary)
 - Improve CI
@@ -288,4 +287,8 @@
  - Bug-fix FastText error not handled in test API.
  - Add feature to allow new_prediction_tags to retrain CLI.
 
+## 0.9.4
+
+  - Improve codebase.
+
 ## dev
diff --git a/deepparse/comparer/formatted_compared_addresses_raw.py b/deepparse/comparer/formatted_compared_addresses_raw.py
@@ -68,16 +68,13 @@ def _comparison_report_builder(self) -> str:
         str_formatted += "Parsed address: " + repr(self.first_address) + "\n"
         str_formatted += str(probs[0]) + "\n"
         if not self.identical:
-
             str_formatted += "\nParsed address: " + repr(self.second_address) + "\n"
             str_formatted += str(probs[1]) + "\n"
 
             if self.equivalent:
-
                 str_formatted += "\n\nRaw differences between the two addresses: \n"
                 str_formatted += self._get_raw_diff_color()
             else:
-
                 str_formatted += "\n\nAddresses tags differences between the two addresses: \n"
                 str_formatted += self._get_tags_diff_color()
 

diff --git a/deepparse/converter/__init__.py b/deepparse/converter/__init__.py
@@ -1,4 +1,5 @@
 # pylint: disable=wildcard-import
-from .data_padding import *
 from .target_converter import *
-from .data_transform import *
+from .data_padder import *
+from .data_processor import *
+from .data_processor_factory import *
diff --git a/deepparse/converter/data_padder.py b/deepparse/converter/data_padder.py
@@ -0,0 +1,204 @@
+from typing import List, Tuple, Union
+
+import torch
+from torch.nn.utils.rnn import pad_sequence
+import numpy as np
+
+
+class DataPadder:
+    """
+    Class that handles the padding of vectorized sequences to the length of the longuest sequence.
+    Args:
+        padding_value (int): the value to use as padding to extend the shorter sequences. Default: -100.
+    """
+
+    def __init__(self, padding_value: int = -100) -> None:
+        self.padding_value = padding_value
+
+    def pad_word_embeddings_batch(
+        self, batch: List[Tuple[List, List]], teacher_forcing: bool = False
+    ) -> Union[
+        Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor],
+        Tuple[Tuple[torch.Tensor, torch.Tensor, torch.Tensor], torch.Tensor],
+    ]:
+        """
+        Method to pad a batch of word embeddings sequences and their targets to the length of the longuest one.
+        Args:
+            batch (List[Tuple[List, List]]): a list of tuples where the first element is a list
+                of word embeddings (the sequence) and the second is a list of targets.
+            teacher_forcing (bool): if True, the padded target vectors are returned twice,
+                once with the sequences and their lengths, and once on their own. This enables
+                the use of teacher forcing during the training of sequence to sequence models.
+        Return:
+            A tuple of two elements:
+                - a tuple containing either two :class:`~torch.Tensor` (the padded sequences and their
+                    repective original lengths),or three :class:`~torch.Tensor` (the padded sequences
+                    and their lengths, as well as the padded targets) if `teacher_forcing` is true.
+                    For details on the padding of sequences,
+                    check out :meth:`~DataPadder.pad_word_embeddings_sequences` below.
+                    The returned sequences are sorted in decreasing order.
+                - a :class:`~torch.Tensor` containing the padded targets.
+        """
+        sequences_vectors, target_vectors = self._extract_word_embeddings_sequences_and_target(batch)
+
+        padded_sequences, lengths = self.pad_word_embeddings_sequences(sequences_vectors)
+        padded_target_vectors = self.pad_targets(target_vectors)
+
+        if teacher_forcing:
+            return (padded_sequences, lengths, padded_target_vectors), padded_target_vectors
+
+        return (padded_sequences, lengths), padded_target_vectors
+
+    def pad_word_embeddings_sequences(self, sequences_batch: Tuple[List, ...]) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Method to pad a batch of word embeddings sequences.
+        Args:
+            seuqnces_batch (Tuple[List, ...]): a tuple containing lists of word embeddings (the sequences)
+        Return:
+            A tuple of two elements:
+                - a :class:`~torch.Tensor` containing the padded sequcences.
+                - a :class:`~torch.Tensor` containing the respective original lengths of the padded sequences.
+        """
+        sequences_vectors, lengths = zip(
+            *[
+                (
+                    torch.FloatTensor(np.array(seq_vectors)),
+                    len(seq_vectors),
+                )
+                for seq_vectors in sequences_batch
+            ]
+        )
+
+        lengths = torch.tensor(lengths)
+
+        padded_sequences_vectors = self._pad_tensors(sequences_vectors)
+
+        return padded_sequences_vectors, lengths
+
+    def pad_subword_embeddings_batch(
+        self, batch: List[Tuple[Tuple[List, List], List]], teacher_forcing: bool = False
+    ) -> Union[
+        Tuple[Tuple[torch.Tensor, List, torch.Tensor], torch.Tensor],
+        Tuple[Tuple[torch.Tensor, List, torch.Tensor, torch.Tensor], torch.Tensor],
+    ]:
+        """
+        Method to pad a batch of subword embeddings sequences and their targets to the length of the longuest one.
+        Args:
+            batch (List[Tuple[Tuple[List, List], List]]): a list of tuples containing the two following elements:
+                - a tuple where the first element is a list of words represented as subword embeddings and the
+                    second element is a list of the number of subword embeddings that each word is decomposed into.
+                - a list of targets.
+            teacher_forcing (bool): if True, the padded target vectors are returned twice,
+                once with the sequences and their lengths, and once on their own. This enables
+                the use of teacher forcing during the training of sequence to sequence models.
+        Return:
+            A tuple of two elements:
+                - A tuple (``x``, ``y`` , ``z``). The element ``x`` is a :class:`~torch.Tensor` of
+                    padded subword vectors,``y`` is a list of padded decomposition lengths,
+                    and ``z`` is a :class:`~torch.Tensor` of the original lengths of the sequences
+                    before padding. If teacher_forcing is True, a fourth element is added which
+                    corresponds to a :class:`~torch.Tensor` of the padded targets. For details
+                    on the padding of sequences, check out :meth:`~DataPadder.pad_subword_embeddings_sequences` below.
+                    The returned sequences are sorted in decreasing order.
+                - a :class:`~torch.Tensor` containing the padded targets.
+        """
+        sequences_tuples, target_vectors = self._extract_subword_embeddings_sequences_and_targets(batch)
+
+        padded_sequences, decomposition_lengths, sequence_lengths = self.pad_subword_embeddings_sequences(
+            sequences_tuples
+        )
+        padded_target_vectors = self.pad_targets(target_vectors)
+
+        if teacher_forcing:
+            return (
+                padded_sequences,
+                decomposition_lengths,
+                sequence_lengths,
+                padded_target_vectors,
+            ), padded_target_vectors
+
+        return (padded_sequences, decomposition_lengths, sequence_lengths), padded_target_vectors
+
+    def pad_subword_embeddings_sequences(
+        self, sequences_batch: Tuple[Tuple[List, List], ...]
+    ) -> Tuple[torch.Tensor, List, torch.Tensor]:
+        """
+        Method to pad a batch of subword embeddings sequences.
+        Args:
+            sequences_batch (Tuple[Tuple[List, List], ...]): a tuple containing tuples of two elements:
+                - a list of lists representing words as lists of subword embeddings.
+                - a list of the number of subword embeddings that each word is decomposed into.
+        Return:
+            A tuple of three elements:
+                - a :class:`~torch.Tensor` containing the padded sequcences.
+                - a list containing the padded decomposition lengths of each word. When a word is
+                    added as padding to elongate a sequence, we consider that the decomposition
+                    length of the added word is 1.
+                - a :class:`~torch.Tensor` containing the respective original lengths (number of words)
+                    of the padded sequences.
+        """
+        sequences_vectors, decomp_len, lengths = zip(
+            *[
+                (
+                    torch.tensor(np.array(vectors)),
+                    word_decomposition_len,
+                    len(vectors),
+                )
+                for vectors, word_decomposition_len in sequences_batch
+            ]
+        )
+
+        padded_sequences_vectors = self._pad_tensors(sequences_vectors)
+
+        lengths = torch.tensor(lengths)
+        max_sequence_length = lengths.max().item()
+        for decomposition_length in decomp_len:
+            if len(decomposition_length) < max_sequence_length:
+                decomposition_length.extend([1] * (max_sequence_length - len(decomposition_length)))
+
+        return padded_sequences_vectors, list(decomp_len), lengths
+
+    def pad_targets(self, target_batch: Tuple[List, ...]) -> torch.Tensor:
+        """
+        Method to pad a batch of target indices to the longuest one.
+        Args:
+            target_batch (Tuple[List, ...]): a tuple comtaining lists of target indices.
+        Return:
+            A :class:`~torch.Tensor` of padded targets.
+        """
+        target_batch = map(torch.tensor, target_batch)
+
+        return self._pad_tensors(target_batch)
+
+    def _extract_word_embeddings_sequences_and_target(self, batch: List[Tuple[List, List]]) -> Tuple[List, List]:
+        """
+        Method that takes a list of word embedding sequences and targets and zips the
+            sequences together and the targets together.
+        """
+        sorted_batch = sorted(batch, key=lambda x: len(x[0]), reverse=True)
+
+        sequence_batch, target_batch = zip(*sorted_batch)
+
+        return sequence_batch, target_batch
+
+    def _extract_subword_embeddings_sequences_and_targets(
+        self, batch: List[Tuple[Tuple[List, List], List]]
+    ) -> Tuple[List[Tuple[List, List]], List]:
+        """
+        Method that takes a list of subword embedding sequences and targets
+            and zips the sequences together and the targets together.
+        """
+        sorted_batch = sorted(batch, key=lambda x: len(x[0][1]), reverse=True)
+
+        sequence_batch, target_batch = zip(*sorted_batch)
+
+        return sequence_batch, target_batch
+
+    def _pad_tensors(self, sequences_batch: Tuple[torch.Tensor, ...]) -> torch.Tensor:
+        """
+        A method to pad and collate multiple :class:``torch.Tensor` representing sequences
+            into a single :class:``torch.Tensor`using :attr:`DataPadder.padding_value`.
+            The final :class:``torch.Tensor` is returned with batch first
+        """
+
+        return pad_sequence(sequences_batch, batch_first=True, padding_value=self.padding_value)