-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Good SoA abstraction #272
Comments
#100 has an attempt with lots of duplication on the SoA definition side (so rather poor from programming model point of view). |
In #100 (comment) @VinInn commented
|
#271 has an approach similar to #272 (comment), i.e. a C struct. Some downsides regarding general use
|
In production there is no need of runtime sizing: if one size does not fill all, we just compile different size and template everything. (for HLT I would like to see first a REAL USE CASE for runtime sizing) I was thinking to enforce alignment at cacheline level (if not a page level) |
I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing. I agree that, in general, we do not want to resize structures once the event data has been filled. But I would say that we want to be able to handle collections of whose size we do not know a priory. |
This was exactly the case I meant. I rephrased my earlier comment to avoid further confusion. |
For the beginning of the memory block yes, but with struct Foo {
float pt[100];
float eta[100];
}; the element |
so we need to static assert that N is a multiple of 128 (which is sane anyway) |
Yup. But the alignment is still a user's responsibility. |
What about |
a possible solution to keep the SOA fixed size and allow dynamic size setting is to bucketize (ASOA) provided the stride is a power of 2 it is quite trivial (and fast) to iterate and do random access.... |
and here the cuda version |
See also #344 . |
Here is my attempt to write a SoA that:
Currently it does not support, but I plan to add them later:
|
step 1: what it might look likeThe fields are something like struct {
double x, y, z;
uint16_t colour;
int32_t value;
}; definition// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {
// these could be moved to an external type trait to free up the symbol names
static const size_t size = SIZE;
static const size_t alignment = ALIGN;
// AoS-like (row-wise) accessor to individual elements
struct element {
element(SoA & soa, size_t index) :
x(soa.x[index]),
y(soa.y[index]),
z(soa.z[index]),
colour(soa.colour[index]),
value(soa.value[index]),
{ }
double & x;
double & y;
double & z;
uint16_t & colour;
int32_t & value;
};
// AoS-like (row-wise) access to individual elements
element operator[](size_t index) { return element(*this, index); }
// data members
alignas(ALIGN) double x[SIZE];
alignas(ALIGN) double y[SIZE];
alignas(ALIGN) double z[SIZE];
alignas(ALIGN) uint16_t colour[SIZE];
alignas(ALIGN) int32_t value[SIZE];
}; sample usageint main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z[7] = 42;
// AoS-like access
soa[7].z = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z[7] == & (soa[7].z));
} |
step 2: use private members and public accessorsNote that we could also use definition// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {
// these could be moved to an external type trait to free up the symbol names
static const size_t size = SIZE;
static const size_t alignment = ALIGN;
// AoS-like (row-wise) accessor to individual elements
struct element {
element(SoA & soa, size_t index) :
soa_(soa),
index_(index)
{ }
double & x() { return soa_.x_[index_]; }
double & y() { return soa_.y_[index_]; }
double & z() { return soa_.z_[index_]; }
uint16_t & colour() { return soa_.colour_[index_]; }
int32_t & value() { return soa_.value_[index_]; }
SoA & soa_;
const size_t index_;
};
// AoS-like (row-wise) access to individual elements
element operator[](size_t index) { return element(*this, index); }
// accessors
double* x() { return x_; }
double* y() { return y_; }
double* z() { return z_; }
uint16_t* colour() { return colour_; }
int32_t* value() { return value_; }
private:
// data members
alignas(ALIGN) double x_[SIZE];
alignas(ALIGN) double y_[SIZE];
alignas(ALIGN) double z_[SIZE];
alignas(ALIGN) uint16_t colour_[SIZE];
alignas(ALIGN) int32_t value_[SIZE];
alignas(ALIGN) const char* name_[SIZE];
}; sample usageint main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z()[7] = 42;
// AoS-like access
soa[7].z() = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z()[7] == & (soa[7].z()));
} |
step 3: make data member configurableThe syntax for the declaration has same flexibility, if one wants to play a bit with it. Note: scroll right for the definition#include <boost/preprocessor/cat.hpp>
#include <boost/preprocessor/stringize.hpp>
#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/variadic/to_seq.hpp>
/* declare SoA data members; these should exapnd to, for example
*
* alignas(ALIGN) double x_[SIZE];
*
*/
#define _DECLARE_SOA_DATA_MEMBER_IMPL(TYPE, NAME) \
alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]);
#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)
#define _DECLARE_SOA_DATA_MEMBERS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare SoA accessors; these should expand to, for example
*
* double* x() { return x_; }
*
*/
#define _DECLARE_SOA_ACCESSOR_IMPL(TYPE, NAME) \
TYPE* NAME() { return BOOST_PP_CAT(NAME, _); }
#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare AoS-like element accessors; these should expand to, for example
*
* double & x() { return * (soa_.x() + index_); }
*
*/
#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(TYPE, NAME) \
TYPE & NAME() { return * (soa_. NAME () + index_); }
#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ELEMENT_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* compile-time sized SoA
*/
#define DECLARE_SOA_TEMPLATE(CLASS, ...) \
template <size_t SIZE, size_t ALIGN=0> \
struct CLASS { \
\
/* these could be moved to an external type trait to free up the symbol names */ \
using self_type = CLASS; \
static const size_t size = SIZE; \
static const size_t alignment = ALIGN; \
\
/* introspection */ \
static void dump() { \
std::cout << #CLASS "<" << SIZE << ", " << ALIGN << "): " << std::endl; \
std::cout << " sizeof(...): " << sizeof(CLASS) << std::endl; \
std::cout << " alignof(...): " << alignof(CLASS) << std::endl; \
_DECLARE_SOA_DUMP_INFOS(__VA_ARGS__) \
std::cout << std::endl; \
} \
\
/* AoS-like accessor to individual elements */ \
struct element { \
element(CLASS & soa, size_t index) : \
soa_(soa), \
index_(index) \
{ } \
\
_DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__) \
\
private: \
CLASS & soa_; \
const size_t index_; \
}; \
\
/* AoS-like accessor */ \
element operator[](size_t index) { return element(*this, index); } \
\
/* accessors */ \
_DECLARE_SOA_ACCESSORS(__VA_ARGS__) \
\
private: \
/* data members */ \
_DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__) \
\
} sample usageDECLARE_SOA_TEMPLATE(SoA,
(double, x),
(double, y),
(double, z),
(uint16_t, colour),
(int32_t, value),
);
int main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z()[7] = 42;
// AoS-like access
soa[7].z() = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z()[7] == & (soa[7].z()));
} |
step 4: add scalar typesI realised that we may want to have scalar types, that are unique for the whole SoA, instead of per-element. definition#include <boost/preprocessor.hpp>
/* declare "scalars" (one value shared across the whole SoA) and "columns" (one vale per element) */
#define SoA_scalar(TYPE, NAME) (0, TYPE, NAME)
#define SoA_column(TYPE, NAME) (1, TYPE, NAME)
/* declare SoA data members; these should exapnd to, for columns:
*
* alignas(ALIGN) double x_[SIZE];
*
* and for scalars:
*
* double x_;
*
*/
#define _DECLARE_SOA_DATA_MEMBER_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]); \
, \
TYPE BOOST_PP_CAT(NAME, _); \
)
#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)
#define _DECLARE_SOA_DATA_MEMBERS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare SoA accessors; these should expand to, for columns:
*
* double* x() { return x_; }
*
* and for scalars:
*
* double& x() { return x_; }
*
*/
#define _DECLARE_SOA_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
TYPE* NAME() { return BOOST_PP_CAT(NAME, _); } \
, \
TYPE& NAME() { return BOOST_PP_CAT(NAME, _); } \
)
#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare AoS-like element accessors; these should expand to, for columns:
*
* double & x() { return * (soa_.x() + index_); }
*
* and for scalars:
*
* double & x() { return soa_.x(); }
*
*/
#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
TYPE & NAME() { return * (soa_. NAME () + index_); } \
, \
TYPE & NAME() { return soa_. NAME (); } \
)
#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ELEMENT_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* compile-time sized SoA
*/
#define declare_SoA_template(CLASS, ...) \
template <size_t SIZE, size_t ALIGN=0> \
struct CLASS { \
\
/* these could be moved to an external type trait to free up the symbol names */ \
using self_type = CLASS; \
static const size_t size = SIZE; \
static const size_t alignment = ALIGN; \
\
/* AoS-like accessor to individual elements */ \
struct element { \
element(CLASS & soa, size_t index) : \
soa_(soa), \
index_(index) \
{ } \
\
_DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__) \
\
private: \
CLASS & soa_; \
const size_t index_; \
}; \
\
/* AoS-like accessor */ \
element operator[](size_t index) { return element(*this, index); } \
\
/* accessors */ \
_DECLARE_SOA_ACCESSORS(__VA_ARGS__) \
\
private: \
/* data members */ \
_DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__) \
\
} sample usage// declare a statically-sized SoA, templated on the column size and (optional) alignment
declare_SoA_template(SoA,
// predefined static scalars
// size_t size;
// size_t alignment;
// columns: one value per element
SoA_column(double, x),
SoA_column(double, y),
SoA_column(double, z),
SoA_column(uint16_t, colour),
SoA_column(int32_t, value),
// scalars: one value for the whole structure
SoA_scalar(const char *, description)
);
int main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z()[7] = 42;
// AoS-like access
soa[7].z() = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z()[7] == & (soa[7].z()));
} |
@vkhristenko if you are interested |
I had a chance to discuss with Alex Huebl (former alpaka developer) in the ATPESC training (he was there as a student as well). He pointed me to the LLAMA (Low Level Abstraction of Memory Access) library that the same team has been developing. On a cursory look it seems interesting (might not solve everything but at least has interesting constructs)
Nested structs like struct Pixel {
struct {
float r
float g
float b;
} color;
char alpha;
}; are expressed like struct color {};
struct alpha {};
struct r {};
struct g {};
struct b {};
using Pixel = llama::DatumStruct <
llama::DatumElement < color, llama::DatumStruct <
llama::DatumElement < r, float >,
llama::DatumElement < g, float >,
llama::DatumElement < b, float >
> >,
llama::DatumElement < alpha, char >
>; (there are shorthands Assuming a 3D array of array[1][2][3].color.g = 1.0; is expressed in LLAMA as view( 1, 2, 3 )( color(), g() ) = 1.0;
// or
view( 1, 2, 3 ).access<color, g>() = 1.0; This interface is always the same regardless of the underlying memory structure (AoS, SoA, something more complicated). For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like. |
@makortel thanks for the pointers. I've had a look and so far I am not overly fond of the syntax and complexity... |
A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw). |
On 29 Jun, 2021, at 4:40 PM, Eric Cano ***@***.***> wrote:
A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax soa[i].x.
A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw).
Could you please share a more direct pointer to the code where is implemented?
v.
… |
The main macro is used to define a SoA class for HGCal. The AoS-like syntax is achieved with |
do you have ptx (even for some simple examples)? |
I finished a test (which generated a SoA GPU side and checks the contents CPU side). The generated PTX is in this gist. The assignment step takes 2 steps to compute the address before accessing the global memory:
|
would be interesting to compare w/r/t a naive implementation a cross product:
|
I made various flavors of the cross product in this repository. The PTX is not very appealing as we get tons of integer computations for few floating point. I suspect and hope this is better optimized during the conversion from PTX to assembly (did not check yet). On the C++ syntax front, we can note that the same templated function could be used with a mixture of a SoA and AoS. Finally, the C++ language does not allow creation of references to temporary variables (and all compilers, gcc and nvcc alike, abide by this rule). Workarounds like this are probably unavoidable to call functions that get references (const references are not affected): auto ri = r[i];
crossProduct(ri, a[i], b[i]); |
this is what Eigen is able to do |
and here a little example that compiles and run which also demonstrates that there is no problem in assigning to a temporary at least with Eigen. |
The issue was simply a compilation mistake: using the The main difference between the two seems to be the ratio of integer computations vs pointer conversions ( Handmade version also loads more parameters. On the floating point side, we have 6 multiplications and 3 subtraction in both cases, as one would expect. That would hint to a SoA system using more pointer arithmetic and less pointer storage to gain a bit of performance. I will try another flavor, where we store offsets instead of pointers. Another direction of investigation is to integrate support for Eigen structures as columns: the current example uses the fact the we declared x, y and z next to each other to made a vector out of the 3. This should be automated to avoid code repetition and caveats (matrices will be impractical to declare in the current fashion). Finally, this example uses a separate data block, and descriptor class pointing to it. We could merge the two together, further limiting the passing of big parameters to the kernel (from a structure to a single pointer). (I will leave this aside for the moment). |
let me note that in my example https://godbolt.org/z/K61q8zMW5 cuda loads the input in not-coherent memory (faster and separated from the standard cache) |
you may also note that in your ptx there are 12 loads while in eigen just 6 (as should be) |
another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory... |
Thanks, @VinInn , the double loading from global (with a cache hit on the second, but still...) was a significant difference. Adding the restricted attributes enabled both the use of non coherent cache and the removal of the duplicate load. This leads to a marginal gain in both cases (~50ns), but the difference remains (now 6.90us vs 6.25us). |
Indeed. There are also integer multiplications by constant powers of 2 instead of shifts in the clang compiled code. Also, nvcc uses |
The offset based version of the SoA (as opposed to pointer based) did allow the reduction of address conversions, but at the detriment of many things (more parameters, loss of non-coherent cache use, re-introduction of multiple fetches of the same variables from global memory). It is version 8 in the same repository. The new PTX for the indirectCrossProduct kernel is here. Here are 2 measurements of the kernel timings (event based for a loop of 20 kernels) and via nvprof for both versions:
These numbers are slightly different from the previous ones which I got via an nvprof dump and looking at them in nvvp (but this new way is faster). With more statistics, the version 7 is actually quite close to the eigen one. As the eigen one is slightly advantaged as we do some pointer computations host side, this is not that far off. One positive outcome of this exercise is that the SoA layout could be changed without modifying the client code. The next and more promising step is to add support for Eigen fixed size types to the pointer based version (v7). |
do not understand why more parameters:
if I got the parenthesis right |
This is all with runtime defined size. The SoA structure is fully defined by a pointer, the number of elements (and an optional byte alignment parameter), so we could also pass just the first 2 and re-compute the SoA kernel side too. SoA(std::byte *mem, size_t nElements, size_t byteAlignment = defaultAlignment) : mem_(mem), nElements_(nElements), byteAlignment_(byteAlignment) {
auto curMem = mem_;
x_ = reinterpret_cast<double *>(curMem);
curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
y_ = reinterpret_cast<double *>(curMem);
curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
z_ = reinterpret_cast<double *>(curMem);
curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
colour_ = reinterpret_cast<uint16_t *>(curMem);
curMem += (((nElements_ * sizeof(uint16_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
value_ = reinterpret_cast<int32_t *>(curMem);
curMem += (((nElements_ * sizeof(int32_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
py_ = reinterpret_cast<double **>(curMem);
curMem += (((nElements_ * sizeof(double *) - 1) / byteAlignment_) + 1) * byteAlignment_;
description_ = reinterpret_cast<const char **>(curMem);
curMem += (((sizeof(const char *) - 1) / byteAlignment_) + 1) * byteAlignment_;
if (mem_ + computeDataSize(nElements_, byteAlignment_) != curMem)
throw std::out_of_range("In "
"SoA"
"::"
"SoA"
": unexpected end pointer.");
} In all examples, this computation is done host side, but we could also push it device side, and re-construct the SoA descriptor there. The |
@VinInn , yes. I used byte alignment, while you have index alignment (which ensures byte alignment and simplicity, at the cost of a bit more padding). I will try that too, it's indeed going to solve the parameter problem. This will require integrating the size of column members (yColumnPosition = xColumnPosition + sizeof (yElement) ), which might be tricky in the current macro based generation (to be investigated). The macros are written once for all and work fine, but are not easy to debug. Which bring be to this question: what about using an external code generator (like a python script, or some pre-existing tool). This would not change the principle of what we have currently: a description of the structure turned into code by an engine. Right now the description is:
The engine is soa_vX.h, and generated code is in the preprocessed output. Does this sound like a good or bad idea to use an external tool? That would require some support from the build system, obviously. |
I personally dislike both macros and code generators (not that complex template manipulation is better, still)
|
Some random comments, as I'm not fully following the discussion. I would prefer to avoid an external code generator if C++ is enough. About macrosvs templates I have mixed feelings: templates are more interested in the language and (more) type safe than macros, but not flexible enough for a clean declaration syntax. About individual pointers vs. base pointer + deltas: I assumed individual pointers would be better based on some comments Vincenzo made some time ago - but I don't have any evidence either way. For deltas I would use byte addressing, because I expect SoAs to have mixed types. |
My summary:
something like this may work and seem to generate optimal access code (comparable to Eigen)
pass how this is generated up to you. |
The version with embedded Eigen structures now works, passes tests. I made a few variants, including trying to push the strict minimum (data pointer + size) and re-constructing the SoA descriptor device side (it's the local object variant). The resulting PTX is here, while the test timings are:
Regarding the PTX, it's pretty well understood, except for the use of non-coherent cache, which I did not manage to get for the embedded (Eigen vector) versions. The embedding code is generic, so we could throw any fixed sized Eigen element into the SoA. The macro based declaration is like this: declare_SoA_template(SoA,
// columns: one value per element
SoA_column(double, x),
SoA_column(double, y),
SoA_column(double, z),
SoA_eigenColumn(Eigen::Vector3d, a),
SoA_eigenColumn(Eigen::Vector3d, b),
SoA_eigenColumn(Eigen::Vector3d, r),
SoA_column(uint16_t, colour),
SoA_column(int32_t, value),
SoA_column(double *, py),
// scalars: one value for the whole structure
SoA_scalar(const char *, description)
); and the SoA is constructed and used like this (example from the locally constructed descriptor in the kernel, data block is already populated): bool deviceConstructor = true;
testSoA::SoA soa(deviceConstructor, data, nElements);
soa[i].r() = soa[i].a().cross(soa[i].b()); I did not mange to get the cast trick to work here, so we might end up with slightly heavier syntax with parenthesis to access members (for uniformity). |
@VinInn , I'm finally getting back to the SoA business. Since I'm still catching up, could you explain your example ?
In particular, where is the assignment to a temporary ? After sitting with Eric and looking at his approach, I now understand that the problem is not in "assigning to a temporary", but in "passing a non-const reference to a temporary". |
We clearly need a good SoA abstraction to work with.
Some feature considerations for explicit memory
cudaMemcpyAsync()
With unified memory the situation would of course be simpler with
The text was updated successfully, but these errors were encountered: