Jump to content
  • 0

MySQL które UTF-8 wybrać dla nowej bazy danych?


fenmar

Question

5 answers to this question

Recommended Posts

  • 0

między utf-8 general a utf-8 unicode są drobne różnice, ale nie są one znaczne. Wszystko sprowadza się do tego, że general jest nieco wydajniejsze, szczególnie w przypadkach kiedy korzystamy ogonków :) więcej informacji poniżej

 

źródło:

 

There are at least two important differences:

  • Accuracy of sorting

    utf8_unicode_ci is based on the Unicode standard for sorting, and sorts accurately in a wide range of languages.

    utf8_general_ci comes very close to correct Unicode sorting in many languages, but has a number of inaccuracies in some languages.

  • Performance

    utf8_general_ci is faster at comparisons and sorting, because it takes a bunch of performance-related shortcuts.

    utf8_unicode_ci uses a much more complex comparison algorithm which aims for correct sorting in a wide range of languages, but this makes it slower to sort and compare large numbers of fields.

Unicode defines complex sets of rules for how characters should be sorted. These rules need to take into account the local conventions; not everybody sorts their characters in what we would call 'alphabetical order'. As far as latin ie "european" languages go, there is not much difference between the Unicode sorting and the simplified utf8_general_ci sorting in MySQL, but there are still a few differences:

  • For examples, Unicode collation sorts "ß" like "ss", and "Œ" like "OE", whereas utf8_general_ci sorts them as single characters like "s" and presumably "e" respectively.

In non-latin languages, such as Asian languages or languages with different alphabets, there may be no differences between utf8_general_ci and utf8_unicode_ci, but on the other hand, there may be a lot more. The suitability of utf8_general_ci heavily depends on the language used.

Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and you should move on to the next character instead. utf8_unicode_ci handles these properly, whereas for performance reasons utf8_general_ci doesn't and a word with the ignorable character will be sorted differently to a word without.

What should you use?

There is almost never any reason to use utf_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by quite other bottlenecks than this nowadays. The difference in performance is only going to be measurable in extremely specialised situations, and if that's you, you'd already know about it. If you're experiencing slow sorting, in almost all cases it'll be an issue with your indexes/query plan.

When I originally wrote this answer (over 3 years ago) I said that if you wanted, you could use utf8_general_ci most of the time, and only use utf8_unicode_ci when sorting was going to be important enough to justify the performance cost. However, the performance cost is no longer really relevant (and it may not have been back then, either). It's more important to sort properly in whichever language your users are using.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...