Non-ASCII Data

When working with languages that use charter sets beyond standard ASCII charter set. There are two things to consider.

  1. When loading tables from RAW data, it is important to ensure proper encoding from originating source to UTF-8 used by Snowflake. Improper encoding loading raw source data can cause data corruption with non-ASCII charters.

  2. Applying Collation definition on the tables that are using non-ASCII data.

  3. Collation allows you to specify alternative rules for comparing strings, which can be used to compare and sort data according to a particular language or other user-specified rules.

  4. Text strings in Snowflake are stored using the UTF-8 character set, and, by default, strings are compared according to the Unicode codes that represent the characters in the string.

  5. However, comparing strings based on their UTF-8 character representations might not provide the desired/expected behavior. For example:

1. If special characters in a given language do not sort according to that language’s ordering standards, sorting might return unexpected results.
2. You might want the strings to be ordered by other rules, such as ignoring whether the characters are uppercase or lowercase.
  1. Collation allows you to explicitly specify the rules to use for comparing strings, based on:
1. Different locales (i.e., different character sets for different languages).
2. Case-sensitivity (i.e., whether to use case-sensitive or case-insensitive string comparisons without explicitly calling the UPPER or LOWER functions to convert the strings).
3. Accent sensitivity (e.g., whether Z, Ź, and Ż are considered the same letter or different letters).
4. Punctuation sensitivity (i.e., whether comparisons use only letters or include all characters).

  1. For example, if a comparison is punctuation-insensitive, then A-B-C and ABC are treated as equivalent.

5. Additional options include preferences for sorting based on the first letter in a string and trimming of leading and/or trailing blank spaces.

Collation Control options in Snowflake are granular. You can explicitly specify the collation to use at different hierarchy object levels:

  1. Account - using the account-level parameter DEFAULT_DDL_COLLATION.
  2. Database - All columns in all tables added to a database, using the ALTER DATABASE command.
  3. Schema - All columns in all tables added to a schema, using the ALTER SCHEMA command.
  4. Table - All columns added to a table using the ALTER TABLE command.
  5. Column - Individual columns in a table, using the CREATE TABLE command.

For additional details on how to use Collation, please see Snowflake documentation:

https://docs.snowflake.com/en/sql-reference/collation.html#collation-support