Flow123d  DF_patch_fe_data_tables-b828b90
assembly_process.h
Go to the documentation of this file.
1 /**
2  * @defgroup assembly_process Assembly Process
3  *
4  * \section classes Overview of assembly process
5  *
6  * Evaluation of Fields is performed during assemblies at different evaluation points. Assembly process allows summary
7  * evaluation of elements on one patch. The calculation is performed subsequently on all these elements.
8  *
9  *
10  * <h2>Solved problems</h2>
11  *
12  * FieldFE use FEValues in implementation of the `value` method, this involves creation of the FEValues object. This object
13  * should be created only once ideally.
14  *
15  * FieldFormula can be evaluated efficiently for the vectors of points. So we need to evaluate lot of points together. That
16  * is possible once the points are from elements of the same region.
17  *
18  * We need to achieve effective use of memory:
19  * - efficient ordering of the mesh nodes and elements
20  * - same evaluations of different field types during the assembly process
21  * - use one memory place involved during assembly on the single element
22  * Enable advantage of the vector operations.
23  *
24  *
25  * <h2>Design ideas and constraints</h2>
26  *
27  * The weak formulation is based on the evaluation of the integrals of four types:
28  * - <b>bulk integral</b> - decomposed to simplix elements
29  * - <b>edge integral</b> - decomposed to edges, edge is set of colocated faces/sides
30  * - <b>boundary integral</b> - decomposed to: boundary element of dimension D + faces of bulk elements of dimension D+1
31  * - <b>coupling integral</b> - decomposed to: bulk element of dimension D + faces of elements of dimension D+1
32  * The integration over elementary domain is done in terms of numerical quadrature, i.e. weighted average of the values
33  * in quadrature points. Single logical quadrature point may be distributed to more then one bulk element.
34  *
35  * All necessary quadrature points must be distributed to the reference elements during setup of the assmebly. FieldFE is
36  * then able to create and precompute the FEValue object for all these local points. The values on the reference element
37  * are computed once. Mapping (of vectors and tensors) to the actual element can be done only for the active points on current
38  * patch of elements.
39  *
40  * For the efficient evaluation of the FieldFormula fields we need to evaluate about 128 points at once. Efficient parsing
41  * and formula evaluation library is necessary. We use a specially developed parser for this purpose: *BParser*.
42  *
43  * In order to allow FieldFormula to depend on other fields we need that other fields are also evaluated into caches. This is
44  * a bit unfortunate for the FieldConstant and possibly cache inefficient in the case of FieldFE. The FieldFE caching is
45  * inefficient only in the case of using same FE space for some field as well as for the equation itself, that is in the case
46  * of nonlinear equations. Even in such case, we at most compute local base functions twice, once when prefetching the field
47  * value cache and second when assembling on individual elements. If the recalculation takes longer then L3 access we can cache
48  * all FEValue data for all patch quad points.
49  *
50  * Example of CPU specification: Intel Core i7: L1d 32kB, L2 256kB, L3 8MB.
51  * [I7 cache specification](https://www.edn.com/design/systems-design/4399725/Memory-Hierarchy-Design---Part-6--The-Intel-Core-i7):
52  * block size 64 bytes, L1d 8way, L2 8way, L3 16way shared for cores (What if every core works on separate memory chunk,
53  * 16 way can be limitting factor?)
54  * Considering about 30 scalar fields, 128 points per field and 8 bytes per double the field cache is about 30kB,
55  * that fits (with other data as Dofs, FEValues) well within L2 cache, even if we consider using just half of
56  * it to allow prefetching. Conclusion: size of the field cache that enables amortization of formula evaluation
57  * is about the size suitable for the CPU cache optimization of the assembly process.
58  *
59  * We need to join evaluation of FieldFormula for individual integrals so there will be more complex logic in the selection of field
60  * algorithm since only subset of the fields is involved in every integral.
61  *
62  * Gauss quadratures imply no shared quadrature points between bulk and face quadratures. There can be some reuse for boundary and
63  * coupling integrals, but these are relatively rare.
64  *
65  * Face integrals should deal with various side permutations.
66  *
67  * We should use existing field variables to referencing their cached values.
68  *
69  * Although we will have a common set of local evaluation points on the reference element we may allow to use only subset of them on
70  * the actual patch of elements.
71  *
72  * The mapping of the pair (element, evaluation point) to the active point on the current patch must be same for all fields invloved
73  * in the assembly. We need a class for this synchronisation.
74  *
75  * Ideas to exploit SIMD:
76  * - Once the fields are precomputed the assmbly of local matrix/vector can use SIMD operations as the forms are evaluated uing a same
77  * formula for every (i-shape, j-shape, q-point).
78  * - We can possibly even assembly more elements at once. Considering the local matrices/vectors be placed in a continuous block of memory.
79  * - Possible problem is that as we iterate over the shape functions the field values remains constant so we may need to load the field values
80  * into constant small vectors.
81  * - FieldFormula - in order to use SIMD we need to organize the patch q-points into the SIMD blocks (4-AVX2, 8-AVX512, seems relevant also
82  * on GPUs). First atempt will just put elements of the same region into a continuous block, further we may modify patches to optimize number
83  * of the regions per patch.
84  * - FieldValues - the vectors and tensors has to be transformed be a mapping matrix. One matrix-vector multiplication takes 9 products and 6
85  * additions these operations should be vectorized over several vectors. Transformation matrices are usually constant on the element so we
86  * need to repeat them.
87  * - FieldFE - see FieldValues, further value in a point is linear combination of the same number of the shape functions (for the same dimension
88  * and order) Vectorisation over points of the elements of the same dimension and order is possible.
89  *
90  *
91  * <h2>Caching structures</h2>
92  *
93  * <b>EvalPoints</b>
94  * Summary all quadrature points on the reference elements. Necessary for the fields precalculation.
95  * - Internaly single table per dimension, but we need to distribute quadrature points accross dimension so there has to be one shared object.
96  * However quadratures are added per dimension in the Assembly classes.
97  * - Every reference element (dimension) have own list of local points. Only resulting 'Integral' objects know ranges of their points.
98  *
99  * <b>Integrals</b>
100  * We have four types of "elementatry integration domains":
101  * - BulkIntegral - on Cell
102  * - EdgeIntegral - on edge.side_range
103  * - CouplingIntegral - on bulk.ngh_range
104  * - BoundaryIntegral - on boundary_side_range (TBD)
105  * - In theory we can combine integrals to have at most single instance of each type per single assembly operation.
106  * - For every dimension we create them by a method of the EvalPoints, passing suitable set of quadratures.
107  * - Functions provided by Integrals:
108  * - Mark active eval points on the actual patch. Assembly algorithm iterates over "elementary integration domains" in the order of their Hilbert
109  * index. The integral has to iterate over its quadrature points. The point and distributing the points on them until we reach the number of
110  * active points per patch.
111  * - Iteration over qudrature points for given actual "elementaty integration domain" (e.g. edge integral for the CellSide)
112  * - Same method Integral::points(assembly_object) is used later on to iterate over quadrature points.
113  * - In principle there is one kind of the quadrature point accessor for every kind of the integral.
114  *
115  * Operations provided by the quadrature points:
116  * - 'field(point)' - retrieve value of a field in the point, e.g. 'pressure(bulk_point)'
117  * - 'side_point.point_on(cell_side)' - returns 'SidePoint' on the other side; allows iteration over all pairs of sides of an edge, in fact allows
118  * to form integrals mixing points of any two faces of same dimension, even non-colocated
119  * - 'coupling_side_point.lower_dim()' - returns 'BulkPoint' collocated with the side point in the dimension coupling.
120  *
121  * Possible interface to shape functions:
122  * - 'ElementFields element_fields ...' - values independent of the FEM, depends on mesh and mapping
123  * - 'FieldFE pressure_fe ...'
124  * - 'element_fields.normal(side_point)' - outer normal at a side point
125  * - 'element_fields.coord(point)' - absolute coordinates of the point
126  * - 'element_fields.JxW(point)' - Jacobian times quadrature weight of the point 'pressure_fe.base(i, point)' - value of the i-th shape function in
127  * the point
128  * - 'pressure_fe.grad(i, point)' - value of the i-th base function in the point.
129  *
130  * <b>ElementCacheMap</b>
131  * This holds mesh elements (sorted by regions) associated with the actual patch and for every pair (mesh element, eval_point) gives the active point
132  * index, these have to form continuous sequence. The points of the "assemble objects" forming the patch are added to the map. There should be no active
133  * points reuse as the active points of the two "assemble objects" of the same kind are disjoint (can change if we decide to change elemetary "assembly
134  * objects", e.g. single NGHSide instead of the range). ElementCacheMap is shared by dimensions !! It should be part of the EqData.
135  *
136  * <b>FieldValueCache</b>
137  * - One per field, holds only raw data for selected integrals.
138  * - It needs to use access to other classes for evaluation ('ElementCacheMap', 'Field' operators).
139  * - One value is a scalar, vector, or tensor. But stored through Array in a single chunk of memory. We probably use a single chunk of memory for
140  * all fields.
141  *
142  *
143  *
144  * <h2>Assemblation classes</h2>
145  *
146  * <b>AssemblyBase &lt;dim&gt; </b>
147  * Common virtual ancestor of all assembly classes. Defines their common methods and data members. One instance for every dimension.
148  *
149  * <b>XYZAssembly &lt;dim&gt; </b>
150  * - descendants of AssemblyBase implementing internals of a single assemblation process
151  * - methods:
152  * - XYZAssembly(eq_fields, eq_data) - constructor, sets shared data with equation
153  * - initialize(element_cache_map) - reinit called typicaly once per simulation, creates FEValues object, resisez vectors etc.
154  * - begin - called at beginning of the general assembly (e.g. start balance or matrix assembly)
155  * - end - called after general asssembly
156  * - cell_integral(cell, element_patch_idx) - assembles the volume integrals on cell
157  * - boundary_side_integral(cell_side) - assembles between boundary element and side of bulk element
158  * - edge_integral(edge_side_range) - assembles between sides on the edge
159  * - dimjoin_intergral(cell_lower_dim, neighb_side) - assembles between elements of different dimensions
160  *
161  * <b>GenericAssembly &lt;...&gt; </b>
162  * - templated by a class derived from AssemblyBase
163  * - contains the assembly machinary, wrap it to a single method 'assembly', this method:
164  * 1. Call 'reallocate' of ElementCacheMap and 'begin' of templated assembly class.
165  * 2. Iterates over cells (elements) and fills patch.
166  * 3. Call assembly methods (cell_integral etc.) of templated assembly class if patch is full.
167  * 4. Repeat points 2. and 3. until all cells have passed.
168  * 5. Call 'end' of templated assembly class.
169  *
170  *
171  *
172  * <h2>Cache operations</h2>
173  *
174  * <b>Overview</b>
175  * Usage of the field caches consists of:
176  * 1. Merging more quadratures into a single set of the local evaluation points (class `EvalPoints`).
177  * 2. Create a `FieldSet` one for every quadrature of the fields involved in that integral.
178  * 3. Initialize the fields in the integral's field set: Allocate the cache space in fields and mark which quadrature the field use. This is done
179  * through the call of `FieldSet::cache_reallocate(ElementCacheMap cache_map, FieldSet used_fields)`
180  * 4. Composing the assambly patch, element cache prefetching. This is organized by
181  * `ElementCacheMap`.
182  * 5. Assembly: Iterate over Integrals and get cached values from the fields.
183  *
184  * <h3>1. Initialization - evaluation points</h3>
185  *
186  * <b>EvalPoints</b>
187  * The class to store set of common local points. Two operations:
188  * 1. Add a point set, return their indices in the table. Contrary, we want to keep the points from the single quadrature in a single continuous
189  * block. In fact, we probably need to use two distinguised methods:
190  * - 'BulkIntegral &&EvalPoints::add_bulk(Quadrature bulk_quadrature)'
191  * - 'BoundaryIntegral &&add_boundary(Quadrature side_quadrature)'
192  * - 'EdgeIntegral &&add_edge(Quadrature side_quadrature)'
193  * - 'CouplingIntegral &&add_coupling(Quadrature bulk_quadrature)'
194  * 2. Get local point coordinates for given index.
195  *
196  * Common storage of quadrature points is not necessary anymore, but we keep it in order to keep indexing of the points the same accross the fields
197  * (which can use ony certain subblocks of the points). We can decrease the memory footprint (and thus the cache footprint) at the expense of slower
198  * access to the cached values (one more indirect memory access).
199  *
200  * <b>EvalSubset</b>
201  * The object containing array of indices into local point set, there is only single array for the bulk quadrature, but (n_sides x n_permutations)
202  * for a side quadrature. We shall try to keep this common for the bulk and side quadratures and non-templated. It keeps pointer to the EvalPoints,
203  * in particular to know total number of local points.
204  *
205  * <b>Example</b>
206  * @dot
207 class Assembly<dim> {
208  Assembly(EqFields *eq_fields, EqData *eq_data)
209  : AssemblyBase<dim>(eq_data_->order), eq_fields_(eq_fields), eq_data_(eq_data) {
210  // set list of used EvalSubsets, these are created automatically in GenericAssembly constructor
211  this->active_integrals_ = ActiveIntegrals::bulk | ActiveIntegrals::boundary;
212  }
213 
214  void initialize(ElementCacheMap *element_cache_map) {
215  this->element_cache_map_ = element_cache_map; // set data member of AssemblyBase
216 
217  // initialize data members
218  fe_ = std::make_shared< FE_P_disc<dim> >(eq_data_->order);
219  UpdateFlags u = update_values | update_JxW_values | update_quadrature_points;
220  fe_values_.initialize(*this->quad_, *fe_, update_values | update_JxW_values | update_quadrature_points);
221  ...
222  }
223 
224  inline void cell_integral(DHCellAccessor cell, unsigned int element_patch_idx)
225  {
226  ...
227  }
228 
229 private:
230  EqFields *eq_fields_; ///< FieldSet objects shared with equation
231  EqData *eq_data_; ///< Data objects shared with equation
232  FieldSet used_fields_; ///< Sub field set contains fields used in calculation.
233  shared_ptr<FiniteElement<dim>> fe_; ///< Finite element for the solution of the advection-diffusion equation.
234  FEValues<3> fe_values_; ///< FEValues of object (of P disc finite element type)
235  ...
236 };
237  * @enddot
238  *
239  * <h3>2. Integration field set</h3>
240  *
241  * This use existing field subset to simplify group operations on those.
242  *
243  * @dot
244  Assembly(EqFields *eq_fields, EqData *eq_data)
245  : AssemblyBase<dim>(eq_data_->order), eq_fields_(eq_fields), eq_data_(eq_data) {
246  this->active_integrals_ = ActiveIntegrals::bulk | ActiveIntegrals::boundary;
247 
248  // add list of used fields to subset
249  this->used_fields_ += eq_fields_->mass_matrix_coef;
250  this->used_fields_ += eq_fields_->retardation_coef;
251  }
252  * @enddot
253  *
254  * <h3>3. Initialization - cache allocation</h3>
255  *
256  * 'FieldSet::cache_reallocate' just adds dependent fields (eg. coordinates in FieldFormula, used fields in FieldModel) and calls the same method
257  * for every 'FieldCommon' in the set.
258  *
259  * This is a virtual method with implementation in 'Field<...>'.
260  *
261  * <i>TODO: Different assembly loops may need different evaluation subsets, so either we need mean to reallocate caches before every assembly loop
262  * or just update only selected part of the field value cache. Current structure of FieldValueCache allows to have allocated individual EvalSubset
263  * tables separately and update only selected subsets. Moreover the other subsets do not block the CPU cache (example of associativity, seems that
264  * AMD processors have 4way L1, 16way L2 and 64way L3 cache; try to get similar info for Intel) anyway we shuld mak an analysis how many memory
265  * blocks we need in any computation loop. Innermost loops should idealy access at most 4 memory blocks (4 pointers).</i>
266  *
267  * <h3>4. ElementCacheMap</h3>
268  *
269  * The map is represented by the table with dimensions [n_cache_elements][n_eval_points]. n_cache_elements is choosen as fixed (relative to number
270  * of QP in dimensions, some heuristic formula necessary), n_eval_points given before assembly. Following algorithm must avoid allocations.
271  *
272  * Filling algorithm:
273  * 1. step - Iterate over integrals collect them into a fronts (per integral type, four types). - Count total number of QP. - Stop when cache
274  * capacity is reached. - Put touched elements into an array patch_elements, pairs (region_idx, element_idx) - Drop last uncomplete integral group
275  * 2. step - sort patch_elements by region - set whole cache map table to -1 - loop over integrals (in fronts) mark needed QP by 1 - loop over whole
276  * cache map number active QPs
277  * 3. step - loop over fields in the field set - for every field make groups of regions for distiguish field algorithms - for every field algorithm
278  * call cache update on QP subset (indices of beginings of SIMD groups)
279  * 4. step - loop over integrals - call assembly routines
280  *
281  * Notes:
282  * - Every integral struct implements methods for steps: 1. and 2.
283  * - For step 1. and 2. the method modifies the element cache map object passed as an argument. Use specific modification methods in ElementCacheMap.
284  * - Step 3. independent of integrals.
285  * - For step 4. we pass the integral object to the assembly routine.
286  * - BulkIntegral: DHCellAccessor + subset index
287  * - EdgeIntegral: keeps range returned by cell_side.edge_sides() + subset index
288  * - DimJoinIntegral: DHCellAccessor, DHCellSideAccessor + bulk_subset, side_subset
289  * - BoundaryIntegral: DHCellSideAccessor + side_subset
290  * - When fields are changed make a topological sort of dependent fields on every region.
291  * - <i>Question: optimisation of FieldFormula evaluation over more regions is a bit complex. Need further analysis of relationships.</i>
292  *
293  * <i>DEPRECATED: This class synchronize the cached elements between (all) fields of single equation. It provides mapping from elements to the cache
294  * index and list of elements to cache. This have overloaded evaluation operator, which returns the index in the cache for the given element (or element
295  * index). The last cache line is overwritten if the index is not in the cache. The implementation use: table cache_idx -> el_idx, hash mapping
296  * el_idx -> cache_idx, list of cache lines that schould be updated.</i>
297  *
298  * @dot
299  inline void cell_integral(DHCellAccessor cell, unsigned int element_patch_idx)
300  {
301  ElementAccessor<3> elm = cell.elm();
302  fe_values_.reinit(elm);
303 
304  for (auto p : this->bulk_points(element_patch_idx) )
305  {
306  ...
307  }
308 
309  }
310  * @enddot
311  *
312  * <b>Cache update:</b>
313  *
314  * @dot
315  // Call cache_update for the fields in the field sets
316  // This can also be done in the generic loop.
317  this->eq_fields->cache_update(this->element_cache_map);
318 
319  // Possible update of the base function values.
320  presssure_field_fe.fe_values.update(cell);
321  * @enddot
322  *
323  * Two major algorithms are in use:
324  * - FieldFE - evaluates base func values in all quadrature points (done once per assembly), dot product with DOFs, optionaly multiplied by the Mapping
325  * matrix (important optimization for vector fields and derivatives, must have support in FEValues)
326  * - FieldFE has three instances of FEValues, one for every dimension. This is possible as the EvalPoints are structured by the dimension.
327  * - During cache update we:
328  * a. for every element in the elementcache map (range of elements in the region)
329  * b. find element dimension, call fe_values[dim].reinit(el) for that element
330  * c. update cache values in that row of the element cache map
331  * - The reinit may be slightly inefficient as all values are computed not only the active. We can try to optimize that after it is in use and we can
332  * see if it is a real problem.
333  * - FieldFormula - evaluates all elements in the patch (same region), in all point from single continuous block od quad points
334  *
335  * <h3>5. Assembly, cache read</h3>
336  *
337  * @dot
338  // Asumme following types:
339  EvalSubset this->mass_eval;
340  EvalSubset this->side_eval;
341  EvalSubset this->ngh_side_eval;
342 
343  ...
344  DHCellAccessor cache_cell = this->element_cache_map(cell);
345  // Bulk integral, no sides, no permutations.
346  for(BulkPoint q_point: this->mass_eval.points(cache_cell)) {
347  // Extracting the cached values.
348  double cs = cross_section(q_point);
349 
350  // Following would be nice to have. Not clear how to
351  // deal with more then single element as fe_values have its own cache that has to be updated.
352  auto base_fn_grad = presssure_field_fe.base_value(q_point);
353  loc_matrix += outer_product((cs * base_fn_grad), base_fn_grad)
354  }
355 
356  // Side integrals.
357  // FieldFE<..> conc;
358  for (DHCellSide side : cache_cell.side_range()) {
359  for(DHCellSide el_ngh_side : side.edge_sides()) {
360  // vector of local side quadrature points in the correct side permutation
361  Range<SidePoint> side_points = this->side_eval.points(side)
362  for (SidePoint p : side_points) {
363  ngh_p = p.permute(el_ngh_side);
364  loc_mat += cross_section(p) * sigma(p) *
365  (conc.base_value(p) * velocity(p)
366  + conc.base_value(ngh_p) * velocity(ngh_p)) * p.normal() / 2;
367  }
368  }
369  }
370 
371  // Dimension coupling
372  // TODO: update
373  for (DHNeighbSide ngh_side cache_cell.neighb_sides()) {
374  // vector of local side quadrature points in the correct side permutation
375  Range<BulkPoint> side_points = this->ngh_side_eval.points(ngh_side);
376  for (auto p : side_points) {
377  side_avg += cross_section(el_ngh_p) * sigma(el_ngh_p) *
378  ( velocity(side_p) + velocity(el_ngh_p)) * side_p.normal();
379  }
380  }
381  * @enddot
382  *
383  * In order to use FieldValueCache in consistent way, we can precompute: quadrature points coordinates, normals and JxW into suitable
384  * FieldValueCache and possibly wrap it into Fields in order to use them in FieldFormula. However direct access should be provided through
385  * BulkPoint and SidePoint. precise We want cache_allocate and cache_update to update the mapping caches however we have to pass the mapping
386  * somehow into XYPoint and to do this on single place we want to pass it to EvalPoints.
387  *
388  * <b>Interface for the bulk integrals</b>
389  *
390  * Bulk integrals are evaluated only on the single element, so we can assume, that all values are cached. No need for evaluationg just some points
391  * etc. 'Field<..>::operator() (BulkPoint)' use dimension and cache element index that are part of the BulkPoint structure to access the cached
392  * value. Similarly we can exted caching in FeValues to more elements and provide acces to them:
393  * - 'FieldFE<..>::base_value(BulkPoint, component)'
394  * - 'FieldFE<..>::base_grad(BulkPoint, component)'
395  *
396  * <b>Interface for the edge integrals</b>
397  *
398  * We need to access values on two elements of two matching sides. All sides of all elements are necessary so we want to cache all side points in
399  * the same way as the values in bulk points. We can access matching side but we also need its proper permutation, to this end we provide SidePoint
400  * and its mapping to the connected side of the other element.
401  * - `Field<..>::operator() (SidePoint)`
402  * - `FieldFE<..>::base_value(SidePoint, component)`
403  * - `FieldFE<..>::base_grad(SidePoint, component)`
404  *
405  * <b>Interface for dimension coupling integrals</b>
406  *
407  * In general the quadrature can be different then the quadrature used on faces, or the could be no integration over the faces as in the case of P1
408  * method. No problem for the lower dim element as we have to evaluate fields at all bulk points on al these elements. Only matching evaluation points
409  * on the connected side of a bulk element are necessary. Moreover we are not able to change mask of evaluation point accoring to the elements.
410  *
411  * Two possible solutions:
412  * - local point sets varies with elements
413  * - evaluation of noncached values, but still want to have the FieldValues in FieldFE for their points
414  *
415  * <i>TODO:
416  * - No sense to have more then single bulk element in bulk integrals.
417  * - In Side integrals need: normals, more then single element (elements of same or different dimension).
418  * - How to match q poiunts on side of higher dim element and bulk of lower dim elemnet. Not clear how it works in current implementation. Is the
419  * side permutation compatible with connected fracture elements?
420  * - How to cache values in quadrature points on faces of the higher dim elements efficiently? We want to precompute only values on a single side.
421  * It could be fine in case of DG if we use same quadratures for side terms as well as for the coupling temrs. But for general quadratures there
422  * could be problem. It seems the either we do not cache in this case or need more general cache.
423  * - How to apply only FeValues to FE living on edges without FeSideValues.</i>
424  *
425  * <b>Further thoughts:</b>
426  * - Extension to interdependent fields: If a field depends on the other field, it recursively informs the other field about quadrature.
427  * - We introduce fields for absolute cooridinates X, Y, Z as well as for the depth, this is related to the generalization of the FieldFormula, that
428  * can use also other fields in the formulas.
429  *
430  *
431  * @ingroup sssembly
432  *
433  */