1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
|
* INCOMPLETE
* XML Schema Inference Rules
** Requirements
XmlReader:
<ul>
- that does not expose EntityReference.
- that does not contain xsd:* elements.
</ul>
XmlSchemaSet: only that was generated by this utility class. See
particle inference section described later.
Actually MS implementation has insufficient check for this input,
so it accepts more than it expects.
*** Allowed schema components
Before infering merged particles with premised particles in
XmlSchemaSet, we have to know what is expected and what is not:
<ul>
- facets are not supported. [a014.xsd]
- xs:all is not supported. [a003.xsd]
- xs:group (ref) is not supported. [a004.xsd]
- xs:choice that does not contain xs:sequence is not
supported [a005.xsd].
- xs:any is not supported. Only xs:element are expected
to be contained in xs:sequence. [a011.xsd]
- same name particles that are still not ambiguous
are computed into invalid particles. It looks
like MS's unexpected bug. [a010.xsd]
- attributeGroup looks not supposed to be there (MS has a
bug around here). [a006.xsd]
- anyAttribute is not regarded as a valid particle, and
the output complexType definition just rips them out.
[a013.xsd]
- but substitutionGroup is not rejected and it will remain
in the output. [a001.xsd]
-> It must be rejected. It breaks choice compatibility.
</ul>
** Processing model
First, parameter XmlSchemaSet is compiled[*1] and interpreted into
its internal schema representation that is going to be used for
XmlReader input examination. The resulting XmlSchemaSet is the same
as the input XmlSchemaSet.
[*1] FIXME: this design might change.
The XmlSchemaSet is compiled and , because 1) it might contain
XmlSchemaInclude items. So it won't be possible to process inference
inside the input schema set. However, reusing the input reduces
some annoyance; to preserve elementFormDefault etc.
Second, XmlReader is moved to content (document element) and
"element inference" starts from here (described later).
Resulting XmlSchemaSet keeps original XmlSchemas into itslef.
For example, it keeps elementFormDefault and attributeFormDefault.
Basically it will process the XmlReader with existing XmlSchemaSet
and won't "merge" two XmlSchemaSets one of which is newly inferred
from this XmlReader. Because anyways the XmlReader will have to
infer sequential nodes (siblings).
Once the element definition is determined (or created), any other
branches in the schema are ignored.
** Attributes
*** attribute component definitions and references.
**** ignored attributes
xsi:type, xsi:schemaLocation and xsi:noNamespaceSchemaLocation
attributes are ignored.
**** special attributes
If xsi:nil does exist, then its content are not handled, while its
attributes are handled.
xml:* schema are predetermined; it has a fixed schema for that ns.
**** namespaced attributes
miscellaneous attributes that resides in a certain namespace is
referenced as <attribute ref="qualified-name" />
**** local attributes
miscellaneous attributes are represented as <attribute name="blah" />
*** attribute occurence
when defining a complexType for a newly-created element, the attribute
can be set as "required". Otherwise, it must be set as "optional".
For every element instance occurence, all attributes are tested
existence, and if it does not, then it must be set as "optional".
*** attribute value types
FIXME: need to describe the relaxation of attribute value types.
** Content model inference
*** inference processing model
Content model consists of two parts;
- content type : empty | elementOnly | textOnly | mixed
- particle : sequence | choice | all | groupRef
On processing reader.Read(), the node is first "tested" against
current schema content model. If the current node on the XmlReader
is not acceptable, then "content model expansion" happens.
<ul>
- If the current node is text content, then process the
text node according to "evaluating text content".
- If the current node is an element, then process it
in accordance with "evaluating particle".
</ul>
*** evaluating element
When an element occured, then it must be accepted as a particle.
First, content type must be examined:
<ul>
- If the content type was simpleType, then it is changed
into complexType with complexContent and mixed='true'.
The inferred content particle must be optional.
- If the content type was empty, then it is changed into
complexType with complexContent (it is not mixed unlike
above). The inferred content particle must be optional.
- If the content type was elementOnly or mixed, no need
to change.
</ul>
Next, the content particle must be evaluated.
According to the input XmlSchemaSet limitations, there will be
only these patterns listed here:
- empty content
- simple content
- sequence (of element particles)
- choice of sequences
**** Reader progress
Every element is tested against current element candidates.
<ul>
- When the target element is a document element, then all
the global elements in XmlSchemaSet are the candidates.
<ul>
- If there is a maching name, then that element
definition is used as the context element for
the node's content, and current particle is
in front of the first particle.
- If there isn't, then the inference engine creates
a new element definition, and content is none
(none != empty).
</ul>
- When the target element is inferred in a new element
definition, then
</ul>
**** Particle inference
IMPORTANT: Here I tried to formalize the inference, but it is
incomplete notes.
Target {particle} to add:
isNew -> <xs:element name={name}> ... </xs:element>
!isNew -> <xs:element name={name minOccurs="0"> ... </xs:element>
no definition
// define complexType and add {particle} to .Particle
toComplexType()
processcontent(ct.Particle, isNew)
simpleType
makeComplexContent()
complexType
empty definition (no content model, no particle)
// -> add xs:element name={name} minOccurs="0" to .Particle
-> processcontent(ct.Particle, isNew)
simple content
-> makeComplexContent()
complex content / extension
-> processContent(cce.Particle, isNew)
complex content / restriction
-> processContent(ccr.Particle, isNew)
.Particle
-> processContent(ct.Particle, isNew)
makeComplexContent()
change to complexType which has complex content mixed="true" and
extension. Discard simple type information. Add {particle} to
extension's .Particle.
processContent(Particle particle, isNew)
if particle is either empty or sequence
processSequential(particle, 0, false, isNew)
else if particle is sequence of choices
processLax(particle, 0)
else
error.
processSequential(Sequence particle, int index, bool consumed, bool isNew)
particle.Count <= index
-> appendSequential(particle, isNew)
sequence
if (particle[index] has the same name)
-> if (consumed) then sequence[index].maxOccurs = inf.
InferElement (sequence[index])
processParticles(particle, index, true)
else
-> if (!consumed)
sequence[index].minOccurs = 0.
processParticle(particle, index+1, false)
else
particle = toSequenceOfChoice(particle)
processLax(particle, index)
processLax(choice, index)
foreach (element el in choice.Items)
if (el has the same name)
InferElement (el)
processLax(choice, index + 1)
return;
appendLax(particle)
appendSequential(particle)
if (particle is empty)
make particle as sequence
sequence.Items.Add(InferElement(null))
appendLax(choice)
choice.Items.Add(InferElement(null))
*** evaluating text content
When text content occured, it must be accepted as simple content.
<ul>
- If the content type was textOnly, then "type relaxation"
happens (described later).
- If the content type was already mixed, then it is skipped.
- If the content type was elementOnly, then the content type
becomes mixed and then skipped.
- If the content type was empty, then its content type
becomes text and then skipped. The type is xs:string (no
type promotion will happen since empty value cannot be
accepted as any other types handles in this design).
</ul>
(Actually inference is done from non post compilation information.)
Note that type relaxation happens only when it is inferred as textOnly
and it always occurs.
** Type inference
All data types are inferred from string value; either element content
or attribute value.
*** primitive type inference
When a string is being evaluated as xs:blahblah typed value, it is
tried against several types.
<ul>
- First, it is evaluated as xs:boolean; true, false<del>, 1 or 0</del>.
- Next, its integer value is computed. 1) If it is
successful, then its value range is examined if it
matches with unsignedByte, byte, unsignedShort, short,
unsignedInt, int, unsignedLong, long, and integer.
- If it was not an integer, then it is evaluated as a float
number, as a double number, and then as a decimal number
as well.
- Next, it is examined as xs:dateTime, xs:duration and
related schema types.
- If if did not match any kind of predefined types, then
xs:string is inferred. No other string-based types (such
as xs:token) are inferred.
</ul>
*** type relaxation
When a string value is being accepted with existing type, the type
might have to change to accept it.
For example:
<ul>
- xs:int cannot accept "abc"
- <del>string with maxLength="3" cannot accept "abcd"</del>
facets are not created anyways and thus not supported
by this inference engine.
- 12345 is not acceptable for xs:unsignedByte, but acceptable
for unsignedShort
</ul>
Here, the new string value is inferred into a simpleType, and then
the processor will compute the most specific common type between
the existing type and the newly inferred type.
|