52 * block size 64 bytes, L1d 8way, L2 8way, L3 16way shared for cores (What if every core works on separate memory chunk,
53 * 16 way can be limitting factor?)
54 * Considering about 30 scalar fields, 128 points per field and 8 bytes per double the field cache is about 30kB,
55 * that fits (with other data as Dofs, FEValues) well within L2 cache, even if we consider using just half of
56 * it to allow prefetching. Conclusion: size of the field cache that enables amortization of formula evaluation
57 * is about the size suitable for the CPU cache optimization of the assembly process.
58 *
59 * We need to join evaluation of FieldFormula for individual integrals so there will be more complex logic in the selection of field
60 * algorithm since only subset of the fields is involved in every integral.
61 *
62 * Gauss quadratures imply no shared quadrature points between bulk and face quadratures. There can be some reuse for boundary and
63 * coupling integrals, but these are relatively rare.
64 *
65 * Face integrals should deal with various side permutations.
66 *
67 * We should use existing field variables to referencing their cached values.
68 *
69 * Although we will have a common set of local evaluation points on the reference element we may allow to use only subset of them on
70 * the actual patch of elements.
71 *
72 * The mapping of the pair (element, evaluation point) to the active point on the current patch must be same for all fields invloved
73 * in the assembly. We need a class for this synchronisation.
74 *
75 * Ideas to exploit SIMD:
76 * - Once the fields are precomputed the assmbly of local matrix/vector can use SIMD operations as the forms are evaluated uing a same
77 * formula for every (i-shape, j-shape, q-point).
78 * - We can possibly even assembly more elements at once. Considering the local matrices/vectors be placed in a continuous block of memory.
79 * - Possible problem is that as we iterate over the shape functions the field values remains constant so we may need to load the field values
80 * into constant small vectors.
81 * - FieldFormula - in order to use SIMD we need to organize the patch q-points into the SIMD blocks (4-AVX2, 8-AVX512, seems relevant also
82 * on GPUs). First atempt will just put elements of the same region into a continuous block, further we may modify patches to optimize number
83 * of the regions per patch.
84 * - FieldValues - the vectors and tensors has to be transformed be a mapping matrix. One matrix-vector multiplication takes 9 products and 6
85 * additions these operations should be vectorized over several vectors. Transformation matrices are usually constant on the element so we
86 * need to repeat them.
87 * - FieldFE - see FieldValues, further value in a point is linear combination of the same number of the shape functions (for the same dimension
88 * and order) Vectorisation over points of the elements of the same dimension and order is possible.
89 *
90 *
91 * <h2>Caching structures</h2>
92 *
93 * <b>EvalPoints</b>
94 * Summary all quadrature points on the reference elements. Necessary for the fields precalculation.
95 * - Internaly single table per dimension, but we need to distribute quadrature points accross dimension so there has to be one shared object.
96 * However quadratures are added per dimension in the Assembly classes.
97 * - Every reference element (dimension) have own list of local points. Only resulting 'Integral' objects know ranges of their points.
98 *
99 * <b>Integrals</b>
100 * We have four types of "elementatry integration domains":
101 * - BulkIntegral - on Cell
102 * - EdgeIntegral - on edge.side_range
103 * - CouplingIntegral - on bulk.ngh_range
104 * - BoundaryIntegral - on boundary_side_range (TBD)
105 * - In theory we can combine integrals to have at most single instance of each type per single assembly operation.
106 * - For every dimension we create them by a method of the EvalPoints, passing suitable set of quadratures.
107 * - Functions provided by Integrals:
108 * - Mark active eval points on the actual patch. Assembly algorithm iterates over "elementary integration domains" in the order of their Hilbert
109 * index. The integral has to iterate over its quadrature points. The point and distributing the points on them until we reach the number of
110 * active points per patch.
111 * - Iteration over qudrature points for given actual "elementaty integration domain" (e.g. edge integral for the CellSide)
112 * - Same method Integral::points(assembly_object) is used later on to iterate over quadrature points.
113 * - In principle there is one kind of the quadrature point accessor for every kind of the integral.
114 *
115 * Operations provided by the quadrature points:
116 * - 'field(point)' - retrieve value of a field in the point, e.g. 'pressure(bulk_point)'
117 * - 'side_point.point_on(cell_side)' - returns 'SidePoint' on the other side; allows iteration over all pairs of sides of an edge, in fact allows
118 * to form integrals mixing points of any two faces of same dimension, even non-colocated
119 * - 'coupling_side_point.lower_dim()' - returns 'BulkPoint' collocated with the side point in the dimension coupling.
120 *
121 * Possible interface to shape functions:
122 * - 'ElementFields element_fields ...' - values independent of the FEM, depends on mesh and mapping
123 * - 'FieldFE pressure_fe ...'
124 * - 'element_fields.normal(side_point)' - outer normal at a side point
125 * - 'element_fields.coord(point)' - absolute coordinates of the point
126 * - 'element_fields.JxW(point)' - Jacobian times quadrature weight of the point 'pressure_fe.base(i, point)' - value of the i-th shape function in
127 * the point
128 * - 'pressure_fe.grad(i, point)' - value of the i-th base function in the point.
129 *
130 * <b>ElementCacheMap</b>
131 * This holds mesh elements (sorted by regions) associated with the actual patch and for every pair (mesh element, eval_point) gives the active point
132 * index, these have to form continuous sequence. The points of the "assemble objects" forming the patch are added to the map. There should be no active
133 * points reuse as the active points of the two "assemble objects" of the same kind are disjoint (can change if we decide to change elemetary "assembly
134 * objects", e.g. single NGHSide instead of the range). ElementCacheMap is shared by dimensions !! It should be part of the EqData.
135 *
136 * <b>FieldValueCache</b>
137 * - One per field, holds only raw data for selected integrals.
138 * - It needs to use access to other classes for evaluation ('ElementCacheMap', 'Field' operators).
139 * - One value is a scalar, vector, or tensor. But stored through Array in a single chunk of memory. We probably use a single chunk of memory for
140 * all fields.
141 *
142 *
143 *
144 * <h2>Assemblation classes</h2>
145 *
146 * <b>AssemblyBase <dim> </b>
147 * Common virtual ancestor of all assembly classes. Defines their common methods and data members. One instance for every dimension.
148 *
149 * <b>XYZAssembly <dim> </b>
150 * - descendants of AssemblyBase implementing internals of a single assemblation process
151 * - methods:
152 * - XYZAssembly(eq_fields, eq_data) - constructor, sets shared data with equation
153 * - initialize(element_cache_map) - reinit called typicaly once per simulation, creates FEValues object, resisez vectors etc.
154 * - begin - called at beginning of the general assembly (e.g. start balance or matrix assembly)
155 * - end - called after general asssembly
156 * - cell_integral(cell, element_patch_idx) - assembles the volume integrals on cell
157 * - boundary_side_integral(cell_side) - assembles between boundary element and side of bulk element
158 * - edge_integral(edge_side_range) - assembles between sides on the edge
159 * - dimjoin_intergral(cell_lower_dim, neighb_side) - assembles between elements of different dimensions
160 *
161 * <b>GenericAssembly <...> </b>
162 * - templated by a class derived from AssemblyBase
163 * - contains the assembly machinary, wrap it to a single method 'assembly', this method:
164 * 1. Call 'reallocate' of ElementCacheMap and 'begin' of templated assembly class.
165 * 2. Iterates over cells (elements) and fills patch.
166 * 3. Call assembly methods (cell_integral etc.) of templated assembly class if patch is full.
167 * 4. Repeat points 2. and 3. until all cells have passed.
168 * 5. Call 'end' of templated assembly class.
169 *
170 *
171 *
172 * <h2>Cache operations</h2>
173 *
174 * <b>Overview</b>
175 * Usage of the field caches consists of:
176 * 1. Merging more quadratures into a single set of the local evaluation points (class `EvalPoints`).
177 * 2. Create a `FieldSet` one for every quadrature of the fields involved in that integral.
178 * 3. Initialize the fields in the integral's field set: Allocate the cache space in fields and mark which quadrature the field use. This is done
179 * through the call of `FieldSet::cache_reallocate(ElementCacheMap cache_map, FieldSet used_fields)`
180 * 4. Composing the assambly patch, element cache prefetching. This is organized by
181 * `ElementCacheMap`.
182 * 5. Assembly: Iterate over Integrals and get cached values from the fields.
194 * 2. Get local point coordinates for given index.
195 *
196 * Common storage of quadrature points is not necessary anymore, but we keep it in order to keep indexing of the points the same accross the fields
197 * (which can use ony certain subblocks of the points). We can decrease the memory footprint (and thus the cache footprint) at the expense of slower
198 * access to the cached values (one more indirect memory access).
199 *
200 * <b>EvalSubset</b>
201 * The object containing array of indices into local point set, there is only single array for the bulk quadrature, but (n_sides x n_permutations)
202 * for a side quadrature. We shall try to keep this common for the bulk and side quadratures and non-templated. It keeps pointer to the EvalPoints,
203 * in particular to know total number of local points.
256 * 'FieldSet::cache_reallocate' just adds dependent fields (eg. coordinates in FieldFormula, used fields in FieldModel) and calls the same method
257 * for every 'FieldCommon' in the set.
258 *
259 * This is a virtual method with implementation in 'Field<...>'.
260 *
261 * <i>TODO: Different assembly loops may need different evaluation subsets, so either we need mean to reallocate caches before every assembly loop
262 * or just update only selected part of the field value cache. Current structure of FieldValueCache allows to have allocated individual EvalSubset
263 * tables separately and update only selected subsets. Moreover the other subsets do not block the CPU cache (example of associativity, seems that
264 * AMD processors have 4way L1, 16way L2 and 64way L3 cache; try to get similar info for Intel) anyway we shuld mak an analysis how many memory
265 * blocks we need in any computation loop. Innermost loops should idealy access at most 4 memory blocks (4 pointers).</i>
266 *
267 * <h3>4. ElementCacheMap</h3>
268 *
269 * The map is represented by the table with dimensions [n_cache_elements][n_eval_points]. n_cache_elements is choosen as fixed (relative to number
270 * of QP in dimensions, some heuristic formula necessary), n_eval_points given before assembly. Following algorithm must avoid allocations.
271 *
272 * Filling algorithm:
273 * 1. step - Iterate over integrals collect them into a fronts (per integral type, four types). - Count total number of QP. - Stop when cache
274 * capacity is reached. - Put touched elements into an array patch_elements, pairs (region_idx, element_idx) - Drop last uncomplete integral group
275 * 2. step - sort patch_elements by region - set whole cache map table to -1 - loop over integrals (in fronts) mark needed QP by 1 - loop over whole
276 * cache map number active QPs
277 * 3. step - loop over fields in the field set - for every field make groups of regions for distiguish field algorithms - for every field algorithm
278 * call cache update on QP subset (indices of beginings of SIMD groups)
282 * - Every integral struct implements methods for steps: 1. and 2.
283 * - For step 1. and 2. the method modifies the element cache map object passed as an argument. Use specific modification methods in ElementCacheMap.
284 * - Step 3. independent of integrals.
285 * - For step 4. we pass the integral object to the assembly routine.
286 * - BulkIntegral: DHCellAccessor + subset index
287 * - EdgeIntegral: keeps range returned by cell_side.edge_sides() + subset index
290 * - When fields are changed make a topological sort of dependent fields on every region.
291 * - <i>Question: optimisation of FieldFormula evaluation over more regions is a bit complex. Need further analysis of relationships.</i>
292 *
293 * <i>DEPRECATED: This class synchronize the cached elements between (all) fields of single equation. It provides mapping from elements to the cache
294 * index and list of elements to cache. This have overloaded evaluation operator, which returns the index in the cache for the given element (or element
295 * index). The last cache line is overwritten if the index is not in the cache. The implementation use: table cache_idx -> el_idx, hash mapping
296 * el_idx -> cache_idx, list of cache lines that schould be updated.</i>
297 *
298 * @dot
299 inline void cell_integral(DHCellAccessor cell, unsigned int element_patch_idx)
300 {
301 ElementAccessor<3> elm = cell.elm();
302 fe_values_.reinit(elm);
303
304 for (auto p : this->bulk_points(element_patch_idx) )
305 {
306 ...
307 }
308
309 }
310 * @enddot
311 *
312 * <b>Cache update:</b>
313 *
314 * @dot
315 // Call cache_update for the fields in the field sets
319 // Possible update of the base function values.
320 presssure_field_fe.fe_values.update(cell);
321 * @enddot
322 *
323 * Two major algorithms are in use:
324 * - FieldFE - evaluates base func values in all quadrature points (done once per assembly), dot product with DOFs, optionaly multiplied by the Mapping
325 * matrix (important optimization for vector fields and derivatives, must have support in FEValues)
326 * - FieldFE has three instances of FEValues, one for every dimension. This is possible as the EvalPoints are structured by the dimension.
327 * - During cache update we:
328 * a. for every element in the elementcache map (range of elements in the region)
329 * b. find element dimension, call fe_values[dim].reinit(el) for that element
330 * c. update cache values in that row of the element cache map
331 * - The reinit may be slightly inefficient as all values are computed not only the active. We can try to optimize that after it is in use and we can
332 * see if it is a real problem.
333 * - FieldFormula - evaluates all elements in the patch (same region), in all point from single continuous block od quad points
405 * <b>Interface for dimension coupling integrals</b>
406 *
407 * In general the quadrature can be different then the quadrature used on faces, or the could be no integration over the faces as in the case of P1
408 * method. No problem for the lower dim element as we have to evaluate fields at all bulk points on al these elements. Only matching evaluation points
409 * on the connected side of a bulk element are necessary. Moreover we are not able to change mask of evaluation point accoring to the elements.
410 *
411 * Two possible solutions:
412 * - local point sets varies with elements
413 * - evaluation of noncached values, but still want to have the FieldValues in FieldFE for their points
414 *
415 * <i>TODO:
416 * - No sense to have more then single bulk element in bulk integrals.
417 * - In Side integrals need: normals, more then single element (elements of same or different dimension).
418 * - How to match q poiunts on side of higher dim element and bulk of lower dim elemnet. Not clear how it works in current implementation. Is the
419 * side permutation compatible with connected fracture elements?
420 * - How to cache values in quadrature points on faces of the higher dim elements efficiently? We want to precompute only values on a single side.
421 * It could be fine in case of DG if we use same quadratures for side terms as well as for the coupling temrs. But for general quadratures there
422 * could be problem. It seems the either we do not cache in this case or need more general cache.
423 * - How to apply only FeValues to FE living on edges without FeSideValues.</i>
424 *
425 * <b>Further thoughts:</b>
426 * - Extension to interdependent fields: If a field depends on the other field, it recursively informs the other field about quadrature.
427 * - We introduce fields for absolute cooridinates X, Y, Z as well as for the depth, this is related to the generalization of the FieldFormula, that