Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UHF-8727: Added transliterating for multiple languages. #571

Merged
merged 5 commits into from
Sep 12, 2023

Conversation

dire
Copy link
Contributor

@dire dire commented Aug 28, 2023

UHF-8727

On some languages (e.g. Arabic, Ukrainian..) the ids of headings become just dashes (------------------).

This doesn't fix Chinese language as that's a bit more complicated.

What was done

  • Added some transliterating logic for other languages to avoid these ids.

How to install

  • Make sure your etusivu instance is up and running on latest dev branch.
    • git pull origin dev
    • make fresh
  • Update the Helfi Platform config
    • composer require drupal/helfi_platform_config:dev-UHF-8727_automatic-id-tweaks
  • Run make drush-cr

How to test

  • Create Arabic and Ukrainian pages with texts in those languages. Add some headings in the content.
  • Check that the IDs in those headings are in latin characters and not just dashes.
  • Can you think of some edge cases? Please, tell.
  • Check that code follows our standards.

@codecov-commenter
Copy link

codecov-commenter commented Aug 28, 2023

Codecov Report

Merging #571 (05de235) into main (f950e0a) will not change coverage.
Report is 45 commits behind head on main.
The diff coverage is n/a.

❗ Current head 05de235 differs from pull request most recent head 185daf8. Consider uploading reports for the commit 185daf8 to get more accurate results

@@            Coverage Diff            @@
##               main     #571   +/-   ##
=========================================
  Coverage     12.74%   12.74%           
  Complexity      236      236           
=========================================
  Files            30       30           
  Lines           902      902           
=========================================
  Hits            115      115           
  Misses          787      787           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Arkkimaagi Arkkimaagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works good, but I'd simplify the swaps a bit to reduce loops and code size.

  1. Since we're lowercasing the content.textContent before we're matching, there's no need for the uppercase matching. We can combine all matching to lowercase ones.
  2. We do not need to do separate regex queries for each character, we can combine the characters to a single query and reduce the loops considerably. So instead of checking for all characters that become b separately with 6 regex matches: 'b': ['б', 'β', 'ب', 'ဗ', 'ბ', 'b'] (and another 4 for large B 'B': ['Б', 'Β', 'ब', 'B'],), we can check for any of the characters that become b with a single regex 'b': '[бβبဗბbब]', that has the lowercase matches of both of the b and B combined.

Here's a code that does what I described above, please consider using it:

'use strict';

(function (Drupal, once, drupalSettings) {
  Drupal.behaviors.table_of_contents = {
    attach: function attach() {

      function findAvailableId(name, reserved, anchors, count) {
        let newName = name;
        if (count > 0) { // Only when headings are not unique on page we want to add counter
          newName += '-' + count;
        }
        if (reserved.includes(newName)) {
          return findAvailableId(name, reserved, anchors, ++count);
        } else if (anchors.includes(newName)) {
          if (count === 0) {
            count++; // When reserved heading is visible on page, lets start counting from 2 instead of 1
          }
          return findAvailableId(name, reserved, anchors, ++count);
        }
        return newName;
      }

      const anchors = [];
      const tableOfContents = document.getElementById('helfi-toc-table-of-contents');
      const tableOfContentsList = document.querySelector('#helfi-toc-table-of-contents-list > ul');
      const mainContent = document.querySelector('main.layout-main-wrapper');
      const reservedElems = document.querySelectorAll('[id]');
      const reserved = []; // Let's list current id's here to avoid creating duplicates
      reservedElems.forEach(function (elem) {
        reserved.push(elem.id);
      });

      // Exclude elements from TOC that are not content:
      // e.g. TOC, sidebar, cookie compliency-banner etc.
      const exclusions = '' +
        ':not(.layout-sidebar-first *)' +
        ':not(.layout-sidebar-second *)' +
        ':not(.tools__container *)' +
        ':not(.breadcrumb__container *)' +
        ':not(#helfi-toc-table-of-contents *)' +
        ':not(.embedded-content-cookie-compliance *)' +
        ':not(.react-and-share-cookie-compliance *)'

      const titleComponents = [
        'h2'+exclusions,
        'h3'+exclusions,
        'h4'+exclusions,
        'h5'+exclusions,
        'h6'+exclusions,
      ];

      const mainLanguages = [
        'en',
        'fi',
        'sv',
      ];

      const swaps = {
        '0': '[°₀۰0]',
        '1': '[¹₁۱1]',
        '2': '[²₂۲2]',
        '3': '[³₃۳3]',
        '4': '[⁴₄۴٤4]',
        '5': '[⁵₅۵٥5]',
        '6': '[⁶₆۶٦6]',
        '7': '[⁷₇۷7]',
        '8': '[⁸₈۸8]',
        '9': '[⁹₉۹9]',
        'a': '[àáảãạăắằẳẵặâấầẩẫậāąåαάἀἁἂἃἄἅἆἇᾀᾁᾂᾃᾄᾅᾆᾇὰᾰᾱᾲᾳᾴᾶᾷаأအာါǻǎªაअاaä]',
        'b': '[бβبဗბbब]',
        'c': '[çćčĉċc©]',
        'd': '[ďðđƌȡɖɗᵭᶁᶑдδدضဍဒდdᴅᴆ]',
        'e': '[éèẻẽẹêếềểễệëēęěĕėεέἐἑἒἓἔἕὲеёэєəဧေဲეएإئe]',
        'f': '[фφفƒფf]',
        'g': '[ĝğġģгґγဂგگg]',
        'h': '[ĥħηήحهဟှჰh]',
        'i': '[íìỉĩịîïīĭįıιίϊΐἰἱἲἳἴἵἶἷὶῐῑῒῖῗіїиဣိီည်ǐიइیii̇ϒ]',
        'j': '[ĵјჯجj]',
        'k': '[ķĸкκقكကკქکk]',
        'l': '[łľĺļŀлλلလლlल]',
        'm': '[мμمမმm]',
        'n': '[ñńňņʼnŋνнنနნn]',
        'o': '[óòỏõọôốồổỗộơớờởỡợøōőŏοὀὁὂὃὄὅὸόоوθိုǒǿºოओoöө]',
        'p': '[пπပპپp]',
        'q': '[ყq]',
        'r': '[ŕřŗрρرრr]',
        's': '[śšşсσșςسصစſსsŝ]',
        't': '[ťţтτțتطဋတŧთტt]',
        'u': '[úùủũụưứừửữựûūůűŭųµуဉုူǔǖǘǚǜუउuўü]',
        'v': '[вვϐv]',
        'w': '[ŵωώဝွw]',
        'x': '[χξx]',
        'y': '[ýỳỷỹỵÿŷйыυϋύΰيယyῠῡὺ]',
        'z': '[źžżзζزဇზz]',
        'aa': '[عआآ]',
        'ae': '[æǽ]',
        'ai': '[ऐ]',
        'ch': '[чჩჭچ]',
        'dj': '[ђđ]',
        'dz': '[џძ]',
        'ei': '[ऍ]',
        'gh': '[غღ]',
        'ii': '[ई]',
        'ij': '[ij]',
        'kh': '[хخხ]',
        'lj': '[љ]',
        'nj': '[њ]',
        'oe': '[öœؤ]',
        'oi': '[ऑ]',
        'oii': '[ऒ]',
        'ps': '[ψ]',
        'sh': '[шშش]',
        'shch': '[щ]',
        'ss': '[ß]',
        'sx': '[ŝ]',
        'th': '[þϑثذظ]',
        'ts': '[цცწ]',
        'ue': '[ü]',
        'uu': '[ऊ]',
        'ya': '[я]',
        'yu': '[ю]',
        'zh': '[жჟژ]',
        'gx': '[ĝ]',
        'hx': '[ĥ]',
        'jx': '[ĵ]',
      };

      // Craft table of contents.
      once('table-of-contents', titleComponents.join(','), mainContent)
        .forEach(function (content) {
          let name = content.textContent
            .toLowerCase()
            .trim();

          // To ensure backwards compatibility, this is done only to "other" languages.
          if (!mainLanguages.includes(drupalSettings.path.currentLanguage)) {
            Object.keys(swaps).forEach((swap) => {
              name = name.replace(new RegExp(swaps[swap], 'g'), swap);
            });
          }
          else {
            name = name
              .replace(/ä/gi, 'a')
              .replace(/ö/gi, 'o')
              .replace(/å/gi, 'a');
          }

          name = name.replace(/\W/g, '-').replace(/\s/g, '-').replace(/-(\d+)$/g, '_$1');

          let nodeName = content.nodeName.toLowerCase();
          if (nodeName === 'button') {
            nodeName = content.parentElement.nodeName.toLowerCase();
          }

          const anchorName = content.id
            ? content.id
            : findAvailableId(name, reserved, anchors, 0);

          anchors.push(anchorName);

          // Create table of contents if component is enabled.
          if (tableOfContentsList && nodeName === "h2") {
            let listItem = document.createElement('li');
            listItem.classList.add('table-of-contents__item');

            let link = document.createElement('a');
            link.classList.add('table-of-contents__link');
            link.href = '#' + anchorName;
            link.textContent = content.textContent.trim();

            listItem.appendChild(link);
            tableOfContentsList.appendChild(listItem);
          }
          // Create anchor links.
          content.setAttribute('id', anchorName);
        });

      // Remove loading text.
      if (tableOfContents) {
        const removeElements = tableOfContents.querySelectorAll('.js-remove');
        removeElements.forEach(function (element) {
          element.remove();
        });
      }
    },
  };
})(Drupal, once, drupalSettings);

Copy link
Contributor

@Arkkimaagi Arkkimaagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please remember to check linter problems. I think there were some problems from this old piece of code that I probably wrote at some point. Now would be a nice time to fix those issues.

@dire
Copy link
Contributor Author

dire commented Sep 5, 2023

Works good, but I'd simplify the swaps a bit to reduce loops and code size.

  1. Since we're lowercasing the content.textContent before we're matching, there's no need for the uppercase matching. We can combine all matching to lowercase ones.
  2. We do not need to do separate regex queries for each character, we can combine the characters to a single query and reduce the loops considerably. So instead of checking for all characters that become b separately with 6 regex matches: 'b': ['б', 'β', 'ب', 'ဗ', 'ბ', 'b'] (and another 4 for large B 'B': ['Б', 'Β', 'ब', 'B'],), we can check for any of the characters that become b with a single regex 'b': '[бβبဗბbब]', that has the lowercase matches of both of the b and B combined.

Very good improvements, thanks! 👏 Meant to also do the array structure change but got lost trying to figure out the Chinese and totally forgot about it.

The changes are applied and I fixed some of the linting errors, but I think some might be left, I didn't get the linter fully work and it might have some wrong configs so a bit careful with the changes.

@dire dire requested a review from Arkkimaagi September 5, 2023 05:02
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@hyrsky hyrsky requested a review from tuutti September 11, 2023 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants