UHF-8727: Added transliterating for multiple languages. #571

dire · 2023-08-28T12:32:23Z

UHF-8727

On some languages (e.g. Arabic, Ukrainian..) the ids of headings become just dashes (------------------).

This doesn't fix Chinese language as that's a bit more complicated.

What was done

Added some transliterating logic for other languages to avoid these ids.

How to install

Make sure your etusivu instance is up and running on latest dev branch.
- git pull origin dev
- make fresh
Update the Helfi Platform config
- composer require drupal/helfi_platform_config:dev-UHF-8727_automatic-id-tweaks
Run make drush-cr

How to test

Create Arabic and Ukrainian pages with texts in those languages. Add some headings in the content.
Check that the IDs in those headings are in latin characters and not just dashes.
Can you think of some edge cases? Please, tell.
Check that code follows our standards.

codecov-commenter · 2023-08-28T12:43:14Z

Codecov Report

Merging #571 (05de235) into main (f950e0a) will not change coverage.
Report is 45 commits behind head on main.
The diff coverage is n/a.

❗ Current head 05de235 differs from pull request most recent head 185daf8. Consider uploading reports for the commit 185daf8 to get more accurate results

@@            Coverage Diff            @@
##               main     #571   +/-   ##
=========================================
  Coverage     12.74%   12.74%           
  Complexity      236      236           
=========================================
  Files            30       30           
  Lines           902      902           
=========================================
  Hits            115      115           
  Misses          787      787

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Arkkimaagi

Works good, but I'd simplify the swaps a bit to reduce loops and code size.

Since we're lowercasing the content.textContent before we're matching, there's no need for the uppercase matching. We can combine all matching to lowercase ones.
We do not need to do separate regex queries for each character, we can combine the characters to a single query and reduce the loops considerably. So instead of checking for all characters that become b separately with 6 regex matches: 'b': ['б', 'β', 'ب', 'ဗ', 'ბ', 'ｂ'] (and another 4 for large B 'B': ['Б', 'Β', 'ब', 'Ｂ'],), we can check for any of the characters that become b with a single regex 'b': '[бβبဗბｂब]', that has the lowercase matches of both of the b and B combined.

Here's a code that does what I described above, please consider using it:

'use strict';

(function (Drupal, once, drupalSettings) {
  Drupal.behaviors.table_of_contents = {
    attach: function attach() {

      function findAvailableId(name, reserved, anchors, count) {
        let newName = name;
        if (count > 0) { // Only when headings are not unique on page we want to add counter
          newName += '-' + count;
        }
        if (reserved.includes(newName)) {
          return findAvailableId(name, reserved, anchors, ++count);
        } else if (anchors.includes(newName)) {
          if (count === 0) {
            count++; // When reserved heading is visible on page, lets start counting from 2 instead of 1
          }
          return findAvailableId(name, reserved, anchors, ++count);
        }
        return newName;
      }

      const anchors = [];
      const tableOfContents = document.getElementById('helfi-toc-table-of-contents');
      const tableOfContentsList = document.querySelector('#helfi-toc-table-of-contents-list > ul');
      const mainContent = document.querySelector('main.layout-main-wrapper');
      const reservedElems = document.querySelectorAll('[id]');
      const reserved = []; // Let's list current id's here to avoid creating duplicates
      reservedElems.forEach(function (elem) {
        reserved.push(elem.id);
      });

      // Exclude elements from TOC that are not content:
      // e.g. TOC, sidebar, cookie compliency-banner etc.
      const exclusions = '' +
        ':not(.layout-sidebar-first *)' +
        ':not(.layout-sidebar-second *)' +
        ':not(.tools__container *)' +
        ':not(.breadcrumb__container *)' +
        ':not(#helfi-toc-table-of-contents *)' +
        ':not(.embedded-content-cookie-compliance *)' +
        ':not(.react-and-share-cookie-compliance *)'

      const titleComponents = [
        'h2'+exclusions,
        'h3'+exclusions,
        'h4'+exclusions,
        'h5'+exclusions,
        'h6'+exclusions,
      ];

      const mainLanguages = [
        'en',
        'fi',
        'sv',
      ];

      const swaps = {
        '0': '[°₀۰０]',
        '1': '[¹₁۱１]',
        '2': '[²₂۲２]',
        '3': '[³₃۳３]',
        '4': '[⁴₄۴٤４]',
        '5': '[⁵₅۵٥５]',
        '6': '[⁶₆۶٦６]',
        '7': '[⁷₇۷７]',
        '8': '[⁸₈۸８]',
        '9': '[⁹₉۹９]',
        'a': '[àáảãạăắằẳẵặâấầẩẫậāąåαάἀἁἂἃἄἅἆἇᾀᾁᾂᾃᾄᾅᾆᾇὰᾰᾱᾲᾳᾴᾶᾷаأအာါǻǎªაअاａä]',
        'b': '[бβبဗბｂब]',
        'c': '[çćčĉċｃ©]',
        'd': '[ďðđƌȡɖɗᵭᶁᶑдδدضဍဒდｄᴅᴆ]',
        'e': '[éèẻẽẹêếềểễệëēęěĕėεέἐἑἒἓἔἕὲеёэєəဧေဲეएإئｅ]',
        'f': '[фφفƒფｆ]',
        'g': '[ĝğġģгґγဂგگｇ]',
        'h': '[ĥħηήحهဟှჰｈ]',
        'i': '[íìỉĩịîïīĭįıιίϊΐἰἱἲἳἴἵἶἷὶῐῑῒῖῗіїиဣိီည်ǐიइیｉi̇ϒ]',
        'j': '[ĵјჯجｊ]',
        'k': '[ķĸкκقكကკქکｋ]',
        'l': '[łľĺļŀлλلလლｌल]',
        'm': '[мμمမმｍ]',
        'n': '[ñńňņŉŋνнنနნｎ]',
        'o': '[óòỏõọôốồổỗộơớờởỡợøōőŏοὀὁὂὃὄὅὸόоوθိုǒǿºოओｏöө]',
        'p': '[пπပპپｐ]',
        'q': '[ყｑ]',
        'r': '[ŕřŗрρرრｒ]',
        's': '[śšşсσșςسصစſსｓŝ]',
        't': '[ťţтτțتطဋတŧთტｔ]',
        'u': '[úùủũụưứừửữựûūůűŭųµуဉုူǔǖǘǚǜუउｕўü]',
        'v': '[вვϐｖ]',
        'w': '[ŵωώဝွｗ]',
        'x': '[χξｘ]',
        'y': '[ýỳỷỹỵÿŷйыυϋύΰيယｙῠῡὺ]',
        'z': '[źžżзζزဇზｚ]',
        'aa': '[عआآ]',
        'ae': '[æǽ]',
        'ai': '[ऐ]',
        'ch': '[чჩჭچ]',
        'dj': '[ђđ]',
        'dz': '[џძ]',
        'ei': '[ऍ]',
        'gh': '[غღ]',
        'ii': '[ई]',
        'ij': '[ĳ]',
        'kh': '[хخხ]',
        'lj': '[љ]',
        'nj': '[њ]',
        'oe': '[öœؤ]',
        'oi': '[ऑ]',
        'oii': '[ऒ]',
        'ps': '[ψ]',
        'sh': '[шშش]',
        'shch': '[щ]',
        'ss': '[ß]',
        'sx': '[ŝ]',
        'th': '[þϑثذظ]',
        'ts': '[цცწ]',
        'ue': '[ü]',
        'uu': '[ऊ]',
        'ya': '[я]',
        'yu': '[ю]',
        'zh': '[жჟژ]',
        'gx': '[ĝ]',
        'hx': '[ĥ]',
        'jx': '[ĵ]',
      };

      // Craft table of contents.
      once('table-of-contents', titleComponents.join(','), mainContent)
        .forEach(function (content) {
          let name = content.textContent
            .toLowerCase()
            .trim();

          // To ensure backwards compatibility, this is done only to "other" languages.
          if (!mainLanguages.includes(drupalSettings.path.currentLanguage)) {
            Object.keys(swaps).forEach((swap) => {
              name = name.replace(new RegExp(swaps[swap], 'g'), swap);
            });
          }
          else {
            name = name
              .replace(/ä/gi, 'a')
              .replace(/ö/gi, 'o')
              .replace(/å/gi, 'a');
          }

          name = name.replace(/\W/g, '-').replace(/\s/g, '-').replace(/-(\d+)$/g, '_$1');

          let nodeName = content.nodeName.toLowerCase();
          if (nodeName === 'button') {
            nodeName = content.parentElement.nodeName.toLowerCase();
          }

          const anchorName = content.id
            ? content.id
            : findAvailableId(name, reserved, anchors, 0);

          anchors.push(anchorName);

          // Create table of contents if component is enabled.
          if (tableOfContentsList && nodeName === "h2") {
            let listItem = document.createElement('li');
            listItem.classList.add('table-of-contents__item');

            let link = document.createElement('a');
            link.classList.add('table-of-contents__link');
            link.href = '#' + anchorName;
            link.textContent = content.textContent.trim();

            listItem.appendChild(link);
            tableOfContentsList.appendChild(listItem);
          }
          // Create anchor links.
          content.setAttribute('id', anchorName);
        });

      // Remove loading text.
      if (tableOfContents) {
        const removeElements = tableOfContents.querySelectorAll('.js-remove');
        removeElements.forEach(function (element) {
          element.remove();
        });
      }
    },
  };
})(Drupal, once, drupalSettings);

Arkkimaagi

Also, please remember to check linter problems. I think there were some problems from this old piece of code that I probably wrote at some point. Now would be a nice time to fix those issues.

dire · 2023-09-05T05:02:03Z

Works good, but I'd simplify the swaps a bit to reduce loops and code size.

Since we're lowercasing the content.textContent before we're matching, there's no need for the uppercase matching. We can combine all matching to lowercase ones.

We do not need to do separate regex queries for each character, we can combine the characters to a single query and reduce the loops considerably. So instead of checking for all characters that become b separately with 6 regex matches: 'b': ['б', 'β', 'ب', 'ဗ', 'ბ', 'ｂ'] (and another 4 for large B 'B': ['Б', 'Β', 'ब', 'Ｂ'],), we can check for any of the characters that become b with a single regex 'b': '[бβبဗბｂब]', that has the lowercase matches of both of the b and B combined.

Very good improvements, thanks! 👏 Meant to also do the array structure change but got lost trying to figure out the Chinese and totally forgot about it.

The changes are applied and I fixed some of the linting errors, but I think some might be left, I didn't get the linter fully work and it might have some wrong configs so a bit careful with the changes.

modules/helfi_toc/assets/js/tableOfContents.js

sonarqubecloud · 2023-09-11T11:00:45Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

UHF-8727: Added transliterating for multiple languages.

d123c34

Arkkimaagi suggested changes Sep 1, 2023

View reviewed changes

dire added 2 commits September 1, 2023 14:16

UHF-8727: Improved the code.

21e4658

UHF-8727: Fixed some linter errors/warnings.

185daf8

dire requested a review from Arkkimaagi September 5, 2023 05:02

tuutti requested changes Sep 11, 2023

View reviewed changes

modules/helfi_toc/assets/js/tableOfContents.js Outdated Show resolved Hide resolved

hyrsky added 2 commits September 11, 2023 13:52

UHF-8727: add comment to regex

943498e

UHF-8727: fix indentation

a0bae48

hyrsky requested a review from tuutti September 11, 2023 11:11

hyrsky approved these changes Sep 12, 2023

View reviewed changes

tuutti approved these changes Sep 12, 2023

View reviewed changes

Arkkimaagi approved these changes Sep 12, 2023

View reviewed changes

dire merged commit d930404 into main Sep 12, 2023

dire deleted the UHF-8727_automatic-id-tweaks branch September 12, 2023 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UHF-8727: Added transliterating for multiple languages. #571

UHF-8727: Added transliterating for multiple languages. #571

dire commented Aug 28, 2023 •

edited by Arkkimaagi

Loading

codecov-commenter commented Aug 28, 2023 •

edited

Loading

Arkkimaagi left a comment

Arkkimaagi left a comment

dire commented Sep 5, 2023 •

edited

Loading

sonarqubecloud bot commented Sep 11, 2023

UHF-8727: Added transliterating for multiple languages. #571

UHF-8727: Added transliterating for multiple languages. #571

Conversation

dire commented Aug 28, 2023 • edited by Arkkimaagi Loading

UHF-8727

What was done

How to install

How to test

codecov-commenter commented Aug 28, 2023 • edited Loading

Codecov Report

Arkkimaagi left a comment

Choose a reason for hiding this comment

Arkkimaagi left a comment

Choose a reason for hiding this comment

dire commented Sep 5, 2023 • edited Loading

sonarqubecloud bot commented Sep 11, 2023

dire commented Aug 28, 2023 •

edited by Arkkimaagi

Loading

codecov-commenter commented Aug 28, 2023 •

edited

Loading

dire commented Sep 5, 2023 •

edited

Loading