mcs/class/System.XML/System.Xml.Schema/XmlSchemaInferenceDesign.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343

* INCOMPLETE

* XML Schema Inference Rules

** Requirements

	XmlReader:
	<ul>
		- that does not expose EntityReference.
		- that does not contain xsd:* elements.
	</ul>

	XmlSchemaSet: only that was generated by this utility class. See
	particle inference section described later.

	Actually MS implementation has insufficient check for this input,
	so it accepts more than it expects.

*** Allowed schema components

	Before infering merged particles with premised particles in
	XmlSchemaSet, we have to know what is expected and what is not:

	<ul>
		- facets are not supported. [a014.xsd]
		- xs:all is not supported. [a003.xsd]
		- xs:group (ref) is not supported. [a004.xsd]
		- xs:choice that does not contain xs:sequence is not
		  supported [a005.xsd].
		- xs:any is not supported. Only xs:element are expected
		  to be contained in xs:sequence. [a011.xsd]
		- same name particles that are still not ambiguous
		  are computed into invalid particles. It looks
		  like MS's unexpected bug. [a010.xsd]
		- attributeGroup looks not supposed to be there (MS has a
		  bug around here). [a006.xsd]
		- anyAttribute is not regarded as a valid particle, and
		  the output complexType definition just rips them out.
		  [a013.xsd]
		- but substitutionGroup is not rejected and it will remain
		  in the output. [a001.xsd]
		  -> It must be rejected. It breaks choice compatibility.
	</ul>


** Processing model

	First, parameter XmlSchemaSet is compiled[*1] and interpreted into
	its internal schema representation that is going to be used for
	XmlReader input examination. The resulting XmlSchemaSet is the same
	as the input XmlSchemaSet.

	[*1] FIXME: this design might change.
	The XmlSchemaSet is compiled and , because 1) it might contain
	XmlSchemaInclude items. So it won't be possible to process inference
	inside the input schema set. However, reusing the input reduces
	some annoyance; to preserve elementFormDefault etc.

	Second, XmlReader is moved to content (document element) and
	"element inference" starts from here (described later).

	Resulting XmlSchemaSet keeps original XmlSchemas into itslef.
	For example, it keeps elementFormDefault and attributeFormDefault.

	Basically it will process the XmlReader with existing XmlSchemaSet
	and won't "merge" two XmlSchemaSets one of which is newly inferred
	from this XmlReader. Because anyways the XmlReader will have to
	infer sequential nodes (siblings).

	Once the element definition is determined (or created), any other
	branches in the schema are ignored.


** Attributes

*** attribute component definitions and references.

**** ignored attributes

	xsi:type, xsi:schemaLocation and xsi:noNamespaceSchemaLocation
	attributes are ignored.

**** special attributes

	If xsi:nil does exist, then its content are not handled, while its
	attributes are handled.

	xml:* schema are predetermined; it has a fixed schema for that ns.

**** namespaced attributes

	miscellaneous attributes that resides in a certain namespace is
	referenced as <attribute ref="qualified-name" />

**** local attributes

	miscellaneous attributes are represented as <attribute name="blah" />


*** attribute occurence

	when defining a complexType for a newly-created element, the attribute
	can be set as "required". Otherwise, it must be set as "optional".

	For every element instance occurence, all attributes are tested
	existence, and if it does not, then it must be set as "optional".

*** attribute value types

	FIXME: need to describe the relaxation of attribute value types.


** Content model inference

*** inference processing model

	Content model consists of two parts;

		- content type : empty | elementOnly | textOnly | mixed
		- particle : sequence | choice | all | groupRef

	On processing reader.Read(), the node is first "tested" against
	current schema content model. If the current node on the XmlReader
	is not acceptable, then "content model expansion" happens.

	<ul>
		- If the current node is text content, then process the
		  text node according to "evaluating text content".
		- If the current node is an element, then process it
		  in accordance with "evaluating particle".
	</ul>


*** evaluating element

	When an element occured, then it must be accepted as a particle.
	First, content type must be examined:

	<ul>
		- If the content type was simpleType, then it is changed
		  into complexType with complexContent and mixed='true'.
		  The inferred content particle must be optional.
		- If the content type was empty, then it is changed into
		  complexType with complexContent (it is not mixed unlike
		  above). The inferred content particle must be optional.
		- If the content type was elementOnly or mixed, no need
		  to change.
	</ul>

	Next, the content particle must be evaluated. 

	According to the input XmlSchemaSet limitations, there will be
	only these patterns listed here:

		- empty content

		- simple content

		- sequence (of element particles)

		- choice of sequences

**** Reader progress

	Every element is tested against current element candidates.

	<ul>
		- When the target element is a document element, then all
		  the global elements in XmlSchemaSet are the candidates.

		<ul>
			- If there is a maching name, then that element
			  definition is used as the context element for
			  the node's content, and current particle is
			  in front of the first particle.
		 	- If there isn't, then the inference engine creates
			  a new element definition, and content is none
			  (none != empty).
		</ul>

		- When the target element is inferred in a new element
		  definition, then 
	</ul>


**** Particle inference

	IMPORTANT: Here I tried to formalize the inference, but it is
	incomplete notes.

	Target {particle} to add:
		isNew  -> <xs:element name={name}> ... </xs:element>
		!isNew -> <xs:element name={name minOccurs="0"> ... </xs:element>

	no definition
	//	define complexType and add {particle} to .Particle
		toComplexType()
		processcontent(ct.Particle, isNew)

	simpleType
		makeComplexContent()

	complexType
		empty definition (no content model, no particle)
	//		-> add xs:element name={name} minOccurs="0" to .Particle
			-> processcontent(ct.Particle, isNew)

		simple content
			-> makeComplexContent()

		complex content / extension
			-> processContent(cce.Particle, isNew)

		complex content / restriction
			-> processContent(ccr.Particle, isNew)

		.Particle
			-> processContent(ct.Particle, isNew)

	makeComplexContent()
		change to complexType which has complex content mixed="true" and
		extension. Discard simple type information. Add {particle} to
		extension's .Particle.

	processContent(Particle particle, isNew)
		if particle is either empty or sequence
			processSequential(particle, 0, false, isNew)
		else if particle is sequence of choices
			processLax(particle, 0)
		else
			error.

	processSequential(Sequence particle, int index, bool consumed, bool isNew)
		particle.Count <= index
			-> appendSequential(particle, isNew)
		sequence
			if (particle[index] has the same name)
			     -> if (consumed) then sequence[index].maxOccurs = inf.
				InferElement (sequence[index])
				processParticles(particle, index, true)
			else
			     -> if (!consumed)
					sequence[index].minOccurs = 0.
					processParticle(particle, index+1, false)
				else
					particle = toSequenceOfChoice(particle)
					processLax(particle, index)

	processLax(choice, index)
		foreach (element el in choice.Items)
			if (el has the same name)
				InferElement (el)
				processLax(choice, index + 1)
				return;
		appendLax(particle)

	appendSequential(particle)
		if (particle is empty)
			make particle as sequence
		sequence.Items.Add(InferElement(null))

	appendLax(choice)
		choice.Items.Add(InferElement(null))


*** evaluating text content

	When text content occured, it must be accepted as simple content.

	<ul>
		- If the content type was textOnly, then "type relaxation"
		  happens (described later).
		- If the content type was already mixed, then it is skipped.
		- If the content type was elementOnly, then the content type
		  becomes mixed and then skipped.
		- If the content type was empty, then its content type
		  becomes text and then skipped. The type is xs:string (no
		  type promotion will happen since empty value cannot be
		  accepted as any other types handles in this design).
	</ul>

	(Actually inference is done from non post compilation information.)

	Note that type relaxation happens only when it is inferred as textOnly
	and it always occurs.


** Type inference

	All data types are inferred from string value; either element content
	or attribute value.


*** primitive type inference

	When a string is being evaluated as xs:blahblah typed value, it is
	tried against several types.

	<ul>
		- First, it is evaluated as xs:boolean; true, false<del>, 1 or 0</del>.

		- Next, its integer value is computed. 1) If it is
		  successful, then its value range is examined if it
		  matches with unsignedByte, byte, unsignedShort, short,
		  unsignedInt, int, unsignedLong, long, and integer.

		- If it was not an integer, then it is evaluated as a float
		  number, as a double number, and then as a decimal number
		  as well.

		- Next, it is examined as xs:dateTime, xs:duration and
		  related schema types.

		- If if did not match any kind of predefined types, then
		  xs:string is inferred. No other string-based types (such
		  as xs:token) are inferred.
	</ul>


*** type relaxation

	When a string value is being accepted with existing type, the type
	might have to change to accept it.
	
	For example:
	<ul>
		- xs:int cannot accept "abc"
		- <del>string with maxLength="3" cannot accept "abcd"</del>
		  facets are not created anyways and thus not supported
		  by this inference engine.
		- 12345 is not acceptable for xs:unsignedByte, but acceptable
		  for unsignedShort
	</ul>

	Here, the new string value is inferred into a simpleType, and then
	the processor will compute the most specific common type between
	the existing type and the newly inferred type.