1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
|
Wikimedia Internationalization Library
======================================
This library provides interfaces and value objects for internationalization (i18n)
of applications in PHP.
It is based on the i18n code used in MediaWiki, and is also intended to be
compatible with [jQuery.i18n], a JavaScript i18n library.
Concepts
--------
Any text string that is needed in an application is a **message**. This might
be something like a button label, a sentence, or a longer text. Each message is
assigned a **message key**, which is used as the identifier in code.
Each message is translated into various languages, each represented by a
**language code**. The message's text (as translated into each language) can
contain **placeholders**, which represents a place in the message where a
**parameter** is to be inserted, and **formatting commands**. It might be plain
text other than these placeholders and formatting commands, or it might be in a
**markup language** such as wikitext or Markdown.
A **formatter** is used to convert the message key and parameters into a text
representation in a particular language and **output format**.
The library itself imposes few restrictions on all of these concepts; this
document contains recommendations to help various implementations operate in
compatible ways.
Usage
-----
<pre lang="php">
use Wikimedia\Message\MessageValue;
use Wikimedia\Message\MessageParam;
use Wikimedia\Message\ParamType;
// Constructor interface
$message = new MessageValue( 'message-key', [
'parameter',
new MessageValue( 'another-message' ),
new MessageParam( ParamType::NUM, 12345 ),
] );
// Fluent interface
$message = ( new MessageValue( 'message-key' ) )
->params( 'parameter', new MessageValue( 'another-message' ) )
->numParams( 12345 );
// Formatting
$messageFormatter = $serviceContainter->get( 'MessageFormatterFactory' )->getTextFormatter( 'de' );
$output = $messageFormatter->format( $message );
</pre>
Class Overview
--------------
### Messages
Messages and their parameters are represented by newable value objects.
**MessageValue** represents an instance of a message, holding the key and any
parameters. It is mutable in that parameters can be added to the object after
creation.
**MessageParam** is an abstract value class representing a parameter to a message.
It has a type (using constants defined in the **ParamType** class) and a value. It
has two implementations:
- **ScalarParam** represents a single-valued parameter, such as a text string, a
number, or another message.
- **ListParam** represents a list of values, which will be joined together with
appropriate separators. It has a "list type" (using constants defined in the
**ListType** class) defining the desired separators.
#### Machine-readable messages
**DataMessageValue** represents a message with additional machine-readable
data. In addition to the key and message parameters, it holds a "code" and
structured data that would be a useful representation of the message in an API
response or the like.
For example, a message for an "integer out of range" error might have one of
three different keys depending on whether the range has a minimum, maximum, or
both. But all should have the same code (representing the concept of "integer
out of range") and should likely have structured data representing the range
directly as `[ 'min' => 1, 'max' => 10 ]` rather than as a flat array of
MessageParam objects.
### Formatters
A formatter for a particular language is obtained from an implementation of
**IMessageFormatterFactory**. No implementation of this interface is provided by
this library. If an environment needs its formatters to vary behavior on things
other than the language code, for example selecting among multiple sources of
messages or markup language used for processing message texts, it should define
a MessageFormatterFactoryFactory of some sort to provide appropriate
IMessageFormatterFactory implementations.
There is no one base interface for all formatters; the intent is that type
hinting will ensure that the formatter being used will produce output in the
expected output format. The defined output formats are:
- **ITextFormatter** produces plain text output.
No implementation of these interfaces are provided by this library.
Formatter implementations are expected to perform the following procedure to
generate the output string:
1. Fetch the message's translation in the formatter's language. Details of this
fetching are unspecified here.
- If no translation is found in the formatter's language, it should attempt
to fall back to appropriate other languages. Details of the fallback are
unspecified here.
- If no translation can be found in any fallback language, a string should
be returned that indicates at minimum the message key that was unable to
be found.
2. Replace placeholders with parameter values.
- Note that placeholders must not be replaced recursively. That is, if a
parameter's value contains text that looks like a placeholder, it must not
be replaced as if it really were a placeholder.
- Certain types of parameters are not substituted directly at this stage.
Instead their placeholders must be replaced with an opaque representation
that will not be misinterpreted during later stages.
- Parameters of type RAW or PLAINTEXT
- TEXT parameters with a MessageValue as the value
- LIST parameters with any late-substituted value as one of their values.
3. Process any formatting commands.
4. Process the source markup language to produce a string in the desired output
format. This may be a no-op, and may be combined with the previous step if
the markup language implements compatible formatting commands.
5. Replace any opaque representations from step 2 with the actual values of
the corresponding parameters.
Guidelines for Interoperability
-------------------------------
Besides allowing for libraries to safely supply their own translations for
every app using them, and apps to easily use libraries' translations instead of
having to retranslate everything, following these guidelines will also help
open source projects use [translatewiki.net] for crowdsourced volunteer
translation into many languages.
### Language codes
[BCP 47] language tags should be used for language codes. If a supplied
language tag is not recognized, at minimum the corresponding tag with all
optional subtags stripped should be tried as a fallback.
All messages must have a translation in English (code "en"). All languages
should fall back to English as a last resort.
The English translations should use `{{PLURAL:...}}` and `{{GENDER:...}}` even
when English doesn't make a grammatical distinction, to signal to translators
that plural/gender support is available.
Language code "qqq" is reserved for documenting messages. Documentation should
describe the context in which the message is used and the values of all
parameters used with the message. Generally this is written in English.
Attempting to obtain a message formatter for "qqq" should return one for "en"
instead.
Language code "qqx" is reserved for debugging. Rather than retrieving
translations from some underlying storage, every key should act as if it were
translated as something `(key-name: $1, $2, $3)` with the number of
placeholders depending on how many parameters are included in the
MessageValue.
### Message keys
Message keys intended for use with external implementations should follow
certain guidelines for interoperability:
- Keys should be restricted to the regular expression `/^[a-z][a-z0-9-]*$/`.
That is, it should consist of lowercase ASCII letters, numbers, and hyphen
only, and should begin with a letter.
- Keys should be prefixed to help avoid collisions. For example, a library
named "ApplePicker" should prefix its message keys with "applepicker-".
- Common values needing translation, such as names of months and weekdays,
should not be prefixed by each library. Libraries needing these should use
keys from the [Common Locale Data Repository][CLDR] and document this
requirement, and environments should provide these messages.
### Message format
Placeholders are represented by `$1`, `$2`, `$3`, and so on. Text like `$100`
is interpreted as a placeholder for parameter 100 if 100 or more parameters
were supplied, as a placeholder for parameter 10 followed by text "0" if
between ten and 99 parameters were supplied, and as a placeholder for parameter
1 followed by text "00" if between one and nine parameters were supplied.
All formatting commands look like `{{NAME:$value1|$value2|$value3|...}}`. Braces
are to be balanced, e.g. `{{NAME:foo|{{bar|baz}}}}` has $value1 as "foo" and
$value2 as "{{bar|baz}}". The name is always case-insensitive.
Anything syntactically resembling a placeholder or formatting command that does
not correspond to an actual paramter or known command should be left unchanged
for processing by the markup language processor.
Libraries providing messages for use by externally-defined formatters should
generally assume no markup language will be applied, and should avoid
constructs used by common markup languages unless they also make sense when
read as plain text.
### Formatting commands
The following formatting commands should be supported.
#### PLURAL
`{{PLURAL:$count|$formA|$formB|...}}` is used to produce plurals.
$count is a number, which may have been formatted with ParamType::NUM.
The number of forms and which count corresponds to which form depend on the
language, for example English uses `{{PLURAL:$1|one|other}}` while Arabic uses
`{{PLURAL:$1|zero|one|two|few|many|other}}`. Details are defined in
[CLDR][CLDR plurals].
It is not possible to "skip" positions while still suppling later ones. If too
few values are supplied, the final form is repeated for subsequent positions.
If there is an explicit plural form to be given for a specific number, it may
be specified with syntax like `{{PLURAL:$1|one egg|$1 eggs|12=a dozen eggs}}`.
#### GENDER
`{{GENDER:$name|$masculine|$feminine|$unspecified}}` is used to handle
grammatical gender, typically when messages refer to user accounts.
This supports three grammatical genders: "male", "female", and a third option
for cases where the gender is unspecified, unknown, or neither male nor female.
It does not attempt to handle animate-inanimate or [T-V] distinctions.
$name is a user account name or other similar identifier. If the name given
does not correspond to any known user account, it should probably use the
$unspecified gender.
If $feminine and/or $unspecified is not specified, the value of $masculine
is normally used in its place.
#### GRAMMAR
`{{GRAMMAR:$form|$term}}` converts a term to an appropriate grammatical form.
If no mapping for $term to $form exists, $term should be returned unchanged.
See [jQuery.i18n § Grammar][jQuery.i18n grammar] for details.
#### BIDI
`{{BIDI:$text}}` applies directional isolation to the wrapped text, to attempt
to avoid errors where directionally-neutral characters are wrongly displayed
when between LTR and RTL content.
This should output U+202A (left-to-right embedding) or U+202B (right-to-left
embedding) before the text, depending on the directionality of the first
strongly-directional character in $text, and U+202C (pop directional
formatting) after, or do something equivalent for the target output format.
### Supplying translations
Code intending its messages to be used by externally-defined formatters should
supply the translations as described by
[jQuery.i18n § Message File Format][jQuery.i18n file format].
In brief, the base directory of the library should contain a directory named
"i18n". This directory should contain JSON files named by code such as
"en.json", "de.json", "qqq.json", each with contents like:
```json
{
"@metadata": {
"authors": [
"Alice",
"Bob",
"Carol",
"David"
],
"last-updated": "2012-09-21"
},
"appname-title": "Example Application",
"appname-sub-title": "An example application",
"appname-header-introduction": "Introduction",
"appname-about": "About this application",
"appname-footer": "Footer text"
}
```
Formatter implementations should be able to consume message data supplied in
this format, either directly via registration of i18n directories to check or
by providing tooling to incorporate it during a build step.
### Machine-readable data
Libraries producing MessageValues as error messages should generally produce
DataMessageValues instead. Codes should be similar to message keys but need
not be prefixed. Data should be restricted to values that will produce valid
output when passed to `json_encode()`.
Libraries producing MessageValues in other contexts should consider whether the
same applies to those contexts.
---
[jQuery.i18n]: https://github.com/wikimedia/jquery.i18n
[BCP 47]: https://tools.ietf.org/rfc/bcp/bcp47.txt
[CLDR]: http://cldr.unicode.org/
[CLDR plurals]: https://www.unicode.org/cldr/charts/latest/supplemental/language_plural_rules.html
[jQuery.i18n grammar]: https://github.com/wikimedia/jquery.i18n#grammar
[jQuery.i18n file format]: https://github.com/wikimedia/jquery.i18n#message-file-format
[translatewiki.net]: https://translatewiki.net/wiki/Translating:New_project
[T-V]: https://en.wikipedia.org/wiki/T%E2%80%93V_distinction
|