Flow123d  DF_profiler_memory_monitor-83fd579
Assembly Process

Overview of assembly process

Evaluation of Fields is performed during assemblies at different evaluation points. Assembly process allows summary evaluation of elements on one patch. The calculation is performed subsequently on all these elements.

Solved problems

FieldFE use FEValues in implementation of the value method, this involves creation of the FEValues object. This object should be created only once ideally.

FieldFormula can be evaluated efficiently for the vectors of points. So we need to evaluate lot of points together. That is possible once the points are from elements of the same region.

We need to achieve effective use of memory:

Design ideas and constraints

The weak formulation is based on the evaluation of the integrals of four types:

All necessary quadrature points must be distributed to the reference elements during setup of the assmebly. FieldFE is then able to create and precompute the FEValue object for all these local points. The values on the reference element are computed once. Mapping (of vectors and tensors) to the actual element can be done only for the active points on current patch of elements.

For the efficient evaluation of the FieldFormula fields we need to evaluate about 128 points at once. Efficient parsing and formula evaluation library is necessary. We use a specially developed parser for this purpose: BParser.

In order to allow FieldFormula to depend on other fields we need that other fields are also evaluated into caches. This is a bit unfortunate for the FieldConstant and possibly cache inefficient in the case of FieldFE. The FieldFE caching is inefficient only in the case of using same FE space for some field as well as for the equation itself, that is in the case of nonlinear equations. Even in such case, we at most compute local base functions twice, once when prefetching the field value cache and second when assembling on individual elements. If the recalculation takes longer then L3 access we can cache all FEValue data for all patch quad points.

Example of CPU specification: Intel Core i7: L1d 32kB, L2 256kB, L3 8MB. I7 cache specification: block size 64 bytes, L1d 8way, L2 8way, L3 16way shared for cores (What if every core works on separate memory chunk, 16 way can be limitting factor?) Considering about 30 scalar fields, 128 points per field and 8 bytes per double the field cache is about 30kB, that fits (with other data as Dofs, FEValues) well within L2 cache, even if we consider using just half of it to allow prefetching. Conclusion: size of the field cache that enables amortization of formula evaluation is about the size suitable for the CPU cache optimization of the assembly process.

We need to join evaluation of FieldFormula for individual integrals so there will be more complex logic in the selection of field algorithm since only subset of the fields is involved in every integral.

Gauss quadratures imply no shared quadrature points between bulk and face quadratures. There can be some reuse for boundary and coupling integrals, but these are relatively rare.

Face integrals should deal with various side permutations.

We should use existing field variables to referencing their cached values.

Although we will have a common set of local evaluation points on the reference element we may allow to use only subset of them on the actual patch of elements.

The mapping of the pair (element, evaluation point) to the active point on the current patch must be same for all fields invloved in the assembly. We need a class for this synchronisation.

Ideas to exploit SIMD:

Caching structures

EvalPoints Summary all quadrature points on the reference elements. Necessary for the fields precalculation.

Integrals We have four types of "elementatry integration domains":

Operations provided by the quadrature points:

Possible interface to shape functions:

ElementCacheMap This holds mesh elements (sorted by regions) associated with the actual patch and for every pair (mesh element, eval_point) gives the active point index, these have to form continuous sequence. The points of the "assemble objects" forming the patch are added to the map. There should be no active points reuse as the active points of the two "assemble objects" of the same kind are disjoint (can change if we decide to change elemetary "assembly objects", e.g. single NGHSide instead of the range). ElementCacheMap is shared by dimensions !! It should be part of the EqData.

FieldValueCache

Assemblation classes


AssemblyBase <dim> Common virtual ancestor of all assembly classes. Defines their common methods and data members. One instance for every dimension.

XYZAssembly <dim>

GenericAssembly <...>

Cache operations


Overview Usage of the field caches consists of:

  1. Merging more quadratures into a single set of the local evaluation points (class EvalPoints).
  2. Create a FieldSet one for every quadrature of the fields involved in that integral.
  3. Initialize the fields in the integral's field set: Allocate the cache space in fields and mark which quadrature the field use. This is done through the call of FieldSet::cache_reallocate(ElementCacheMap cache_map, FieldSet used_fields)
  4. Composing the assambly patch, element cache prefetching. This is organized by ElementCacheMap.
  5. Assembly: Iterate over Integrals and get cached values from the fields.

1. Initialization - evaluation points

EvalPoints The class to store set of common local points. Two operations:

  1. Add a point set, return their indices in the table. Contrary, we want to keep the points from the single quadrature in a single continuous block. In fact, we probably need to use two distinguised methods:
  2. Get local point coordinates for given index.

Common storage of quadrature points is not necessary anymore, but we keep it in order to keep indexing of the points the same accross the fields (which can use ony certain subblocks of the points). We can decrease the memory footprint (and thus the cache footprint) at the expense of slower access to the cached values (one more indirect memory access).

EvalSubset The object containing array of indices into local point set, there is only single array for the bulk quadrature, but (n_sides x n_permutations) for a side quadrature. We shall try to keep this common for the bulk and side quadratures and non-templated. It keeps pointer to the EvalPoints, in particular to know total number of local points.

Example

2. Integration field set


This use existing field subset to simplify group operations on those.

3. Initialization - cache allocation

'FieldSet::cache_reallocate' just adds dependent fields (eg. coordinates in FieldFormula, used fields in FieldModel) and calls the same method for every 'FieldCommon' in the set.

This is a virtual method with implementation in 'Field<...>'.

TODO: Different assembly loops may need different evaluation subsets, so either we need mean to reallocate caches before every assembly loop or just update only selected part of the field value cache. Current structure of FieldValueCache allows to have allocated individual EvalSubset tables separately and update only selected subsets. Moreover the other subsets do not block the CPU cache (example of associativity, seems that AMD processors have 4way L1, 16way L2 and 64way L3 cache; try to get similar info for Intel) anyway we shuld mak an analysis how many memory blocks we need in any computation loop. Innermost loops should idealy access at most 4 memory blocks (4 pointers).

4. ElementCacheMap


The map is represented by the table with dimensions [n_cache_elements][n_eval_points]. n_cache_elements is choosen as fixed (relative to number of QP in dimensions, some heuristic formula necessary), n_eval_points given before assembly. Following algorithm must avoid allocations.

Filling algorithm:

  1. step - Iterate over integrals collect them into a fronts (per integral type, four types). - Count total number of QP. - Stop when cache capacity is reached. - Put touched elements into an array patch_elements, pairs (region_idx, element_idx) - Drop last uncomplete integral group
  2. step - sort patch_elements by region - set whole cache map table to -1 - loop over integrals (in fronts) mark needed QP by 1 - loop over whole cache map number active QPs
  3. step - loop over fields in the field set - for every field make groups of regions for distiguish field algorithms - for every field algorithm call cache update on QP subset (indices of beginings of SIMD groups)
  4. step - loop over integrals - call assembly routines

Notes:

DEPRECATED: This class synchronize the cached elements between (all) fields of single equation. It provides mapping from elements to the cache index and list of elements to cache. This have overloaded evaluation operator, which returns the index in the cache for the given element (or element index). The last cache line is overwritten if the index is not in the cache. The implementation use: table cache_idx -> el_idx, hash mapping el_idx -> cache_idx, list of cache lines that schould be updated.

Cache update:

Two major algorithms are in use:

5. Assembly, cache read


In order to use FieldValueCache in consistent way, we can precompute: quadrature points coordinates, normals and JxW into suitable FieldValueCache and possibly wrap it into Fields in order to use them in FieldFormula. However direct access should be provided through BulkPoint and SidePoint. precise We want cache_allocate and cache_update to update the mapping caches however we have to pass the mapping somehow into XYPoint and to do this on single place we want to pass it to EvalPoints.

Interface for the bulk integrals

Bulk integrals are evaluated only on the single element, so we can assume, that all values are cached. No need for evaluationg just some points etc. 'Field<..>::operator() (BulkPoint)' use dimension and cache element index that are part of the BulkPoint structure to access the cached value. Similarly we can exted caching in FeValues to more elements and provide acces to them:

Interface for the edge integrals

We need to access values on two elements of two matching sides. All sides of all elements are necessary so we want to cache all side points in the same way as the values in bulk points. We can access matching side but we also need its proper permutation, to this end we provide SidePoint and its mapping to the connected side of the other element.

Interface for dimension coupling integrals

In general the quadrature can be different then the quadrature used on faces, or the could be no integration over the faces as in the case of P1 method. No problem for the lower dim element as we have to evaluate fields at all bulk points on al these elements. Only matching evaluation points on the connected side of a bulk element are necessary. Moreover we are not able to change mask of evaluation point accoring to the elements.

Two possible solutions:

TODO:

Further thoughts: