Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLItem - strip invalid non UTF-8 characters #502

Closed
rsoika opened this issue May 5, 2019 · 0 comments
Closed

XMLItem - strip invalid non UTF-8 characters #502

rsoika opened this issue May 5, 2019 · 0 comments

Comments

@rsoika
Copy link
Member

rsoika commented May 5, 2019

It can happen that not UTF-8 compliant characters occur in a String Object. (for example if users pastes mixed text content into a text field)

This results in a JAX-B marshaling process to an invalid XML output. As a result the XML output can not be further processed!

To avoid this, we need a routine that filters out such characters.

I found a solution in Mark McLaren's Weblog: http://blog.mark-mclaren.info/2007/02/invalid-xml-characters-when-valid-utf8_5873.html

This is a code example form Mark McLaren

/**
     * This method ensures that the output String has only
     * valid XML unicode characters as specified by the
     * XML 1.0 standard. For reference, please see
     * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
     * standard</a>. This method will return an empty
     * String if the input is null or empty.
     *
     * @param in The String whose non-valid characters we want to remove.
     * @return The in String, stripped of non-valid characters.
     */
    public String stripNonValidXMLCharacters(String in) {
        StringBuffer out = new StringBuffer(); // Used to hold the output.
        char current; // Used to reference the current character.

        if (in == null || ("".equals(in))) return ""; // vacancy test.
        for (int i = 0; i < in.length(); i++) {
            current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
            if ((current == 0x9) ||
                (current == 0xA) ||
                (current == 0xD) ||
                ((current >= 0x20) && (current <= 0xD7FF)) ||
                ((current >= 0xE000) && (current <= 0xFFFD)) ||
                ((current >= 0x10000) && (current <= 0x10FFFF)))
                out.append(current);
        }
        return out.toString();
    }  
@rsoika rsoika added this to the 4.5.6 milestone May 5, 2019
rsoika added a commit that referenced this issue May 5, 2019
issue #502
@rsoika rsoika added the testing label May 5, 2019
@rsoika rsoika closed this as completed May 8, 2019
bvfalcon pushed a commit to bvfalcon/imixs-workflow that referenced this issue Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant