Discussion:
[brlcad-devel] GPU Boolean Evaluation for CSG Ray-Tracing
Marco Domingues
2017-06-19 17:36:09 UTC
Permalink
Hello,

So after reviewing the code with the ‘weave_segs’ kernel last thursday with Vasco Costa via Skype, i made some improvements on the code and submitted today an initial patch with the ‘weave_segs’ kernel and the code to weave segments into partitions. (The patch can be found here https://sourceforge.net/p/brlcad/patches/468/ <https://sourceforge.net/p/brlcad/patches/468/>).

It contains a bit vector of 32 bits to represent the segments in each partition and the respective assertion on the host code to ensure that there aren’t more than 32 segments per ray.
Would appreciate some feedback on the patch.

I tested the code with scenes containing union, difference and intersection operations and the resulting images matched the images from the ansi c code.

I will attach a diff file with the code I made to eval the partitions and to shade the evaluated partitions (white/black shading). It doesn’t support CSG scenes with more than 1 region yet!

I will be working next on adding support to evaluate CSG scenes with more than 1 region and shading the objects with the correct materials.

Thanks in advance for the help!
Marco Domingues
Christopher Sean Morrison
2017-06-20 17:03:54 UTC
Permalink
Hi Marco,


So after reviewing the code with the ‘weave_segs’ kernel last thursday with Vasco Costa via Skype, i made some improvements on the code and submitted today an initial patch with the ‘weave_segs’ kernel and the code to weave segments into partitions. (The patch can be found here https://sourceforge.net/p/brlcad/patches/468/).


Outstanding.  Great to see some real useful progress.



It contains a bit vector of 32 bits to represent the segments in each partition and the respective assertion on the host code to ensure that there aren’t more than 32 segments per ray.
Would appreciate some feedback on the patch.


Where in the patch is the 32-bit limit exactly?  I glanced quickly, but didn't see a comment.  Please at least leave a comment about it even if the limit is type-based.  Also, curious why you disabled all of the coloring code in rt.cl  Presumably just for testing, but did something not work right?  Also curious that your cl partition structure still contains pointers (instead of SoA) -- that intentional?

 
I tested the code with scenes containing union, difference and intersection operations and the resulting images matched the images from the ansi c code.


Can you report what the difference in timings is looking like for just segment weaving?  I know your test cases were pretty simple, but curious where the times are at in comparison.


I will attach a diff file with the code I made to eval the partitions and to shade the evaluated partitions (white/black shading). It doesn’t support CSG scenes with more than 1 region yet!


Feel free to share a picture, or include some more in your dev log.  The picture from June 9th was cool to see.  While we do eventually want the raytrace picture to match, it's more important to first get ray partitions matching.  For comparing Ansi C timings, you should try using a different lighting model (rt -l#) that does less work, otherwise it's apples n' oranges different.  Diffuse is -l1, surface normals is -l2, curvature is -l5.  Feel free to add a new lighting model that does the simple while/black shading so you have a more exact comparison.



I will be working next on adding support to evaluate CSG scenes with more than 1 region and shading the objects with the correct materials.


Excellent.  Sounds like Vasco has you right on track.  Thanks for the update!



Cheers!

Sean
Vasco Alexandre da Silva Costa
2017-06-20 18:56:48 UTC
Permalink
Post by Marco Domingues
It contains a bit vector of 32 bits to represent the segments in each
partition and the respective assertion on the host code to ensure that
there aren’t more than 32 segments per ray.
Would appreciate some feedback on the patch.
Where in the patch is the 32-bit limit exactly? I glanced quickly, but
didn't see a comment. Please at least leave a comment about it even if the
limit is type-based. Also, curious why you disabled all of the coloring
code in rt.cl Presumably just for testing, but did something not work
right? Also curious that your cl partition structure still contains
pointers (instead of SoA) -- that intentional?
Like Marco said there's a 32-bit (uint) bitvector on the partition which
states which ray segment(s) are contained in that partition. This assumes
there aren't more than 32 segments per ray. Which I think is a reasonable
assumption considering most scenes don't have a lot of depth complexity. If
this becomes a problem for whatever reason, it's possible to use a dynamic
bitvector, by checking the max amount of segments per ray before allocating
the partitions at the expense of more cache trashing.

I don't know about this case in particular but I didn't find much advantage
in using a SoA (structure-of-arrays) storage scheme to store the primitives
on a GPU either. This seems to be more of an issue in vector architectures
like SSE which had a lot of memory alignment issues and the like.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-06-21 12:12:46 UTC
Permalink
Post by Christopher Sean Morrison
Hi Marco,
Post by Marco Domingues
So after reviewing the code with the ‘weave_segs’ kernel last thursday with Vasco Costa via Skype, i made some improvements on the code and submitted today an initial patch with the ‘weave_segs’ kernel and the code to weave segments into partitions. (The patch can be found here https://sourceforge.net/p/brlcad/patches/468/ <https://sourceforge.net/p/brlcad/patches/468/>).
Outstanding. Great to see some real useful progress.
Post by Marco Domingues
It contains a bit vector of 32 bits to represent the segments in each partition and the respective assertion on the host code to ensure that there aren’t more than 32 segments per ray.
Would appreciate some feedback on the patch.
Where in the patch is the 32-bit limit exactly? I glanced quickly, but didn't see a comment. Please at least leave a comment about it even if the limit is type-based. Also, curious why you disabled all of the coloring code in rt.cl Presumably just for testing, but did something not work right? Also curious that your cl partition structure still contains pointers (instead of SoA) -- that intentional?
My bad! Should have added a comment in the first place. Will do so! Yes I disabled the colouring code in rt.cl only for testing purpose. That code would overwrite the white shading from boolean evaluation since it shades all intersected segments and not only the segments from evaluated partitions. I will have to change that code to shade only segments from evaluated partitions. I just wanted an easier solution to test the code.

Regarding the pointers in the cl_partition structure there isn’t any good reason to use them, and I should replace it to something like this:

struct partition {
struct seg inseg;
struct hit inhit;
struct seg outseg;
struct hit outhit;
uint segs; /* 32-bit vector to represent the segments in the partition */
}

Thanks for pointing that out
Post by Christopher Sean Morrison
Post by Marco Domingues
I tested the code with scenes containing union, difference and intersection operations and the resulting images matched the images from the ansi c code.
Can you report what the difference in timings is looking like for just segment weaving? I know your test cases were pretty simple, but curious where the times are at in comparison.
Post by Marco Domingues
I will attach a diff file with the code I made to eval the partitions and to shade the evaluated partitions (white/black shading). It doesn’t support CSG scenes with more than 1 region yet!
Feel free to share a picture, or include some more in your dev log. The picture from June 9th was cool to see. While we do eventually want the raytrace picture to match, it's more important to first get ray partitions matching. For comparing Ansi C timings, you should try using a different lighting model (rt -l#) that does less work, otherwise it's apples n' oranges different. Diffuse is -l1, surface normals is -l2, curvature is -l5. Feel free to add a new lighting model that does the simple while/black shading so you have a more exact comparison.
Post by Marco Domingues
I will be working next on adding support to evaluate CSG scenes with more than 1 region and shading the objects with the correct materials.
Excellent. Sounds like Vasco has you right on track. Thanks for the update!
I followed your suggestion and added a lighting model to the ansi c code that does the white shading :) And uploaded some more new images of the results and also added a table with the time comparison. Running the OCL code in my CPU is faster than running it on a Nvidia GTX 970, which isn't a great board to perform double-precision calculations. I have included those times for reference.
Post by Christopher Sean Morrison
Cheers!
Sean
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
BRL-CAD Developer mailing list
https://lists.sourceforge.net/lists/listinfo/brlcad-devel <https://lists.sourceforge.net/lists/listinfo/brlcad-devel>
Thanks for the help!
Cheers,
Marco
Vasco Alexandre da Silva Costa
2017-06-21 15:28:30 UTC
Permalink
Post by Marco Domingues
I followed your suggestion and added a lighting model to the ansi c code
that does the white shading :) And uploaded some more new images of the
results and also added a table with the time comparison.
Right now you got rt_boolweave() and bool_eval() working. This won't be
complete until you implement rt_boolfinal(). So yes there are still some
limitations but those should be solved further along the shedule.
Post by Marco Domingues
Running the OCL code in my CPU is faster than running it on a Nvidia GTX
970, which isn't a great board to perform double-precision calculations. I
have included those times for reference.
http://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#21_June

There are lots of optimizations which can be done after we get things
working:
- There is some CPU<-> GPU traffic I left over between, the count_hits()
and store_hits() kernels, to perform a prefix sum which should be done on
the GPU.
- The RPN tree I implemented for bool_eval() is like memory space optimal
and code minimal but not traversal steps optimal. It should be possible to
evaluate the expression in less steps with a different expression
representation.
- There's bitscanning going on in bool_eval() to check which objects are
intersected in the partitions which could probably be further optimized
with bit ops (e.g. with OpenCL clz and shifts).
- The data structure that is used to store the partitions is not optimal,
since we're basically using a dynamic array when we could be using a
dynamic list, given that it's possible to insert partitions in the middle
of the partition list. This should make a difference in scenes with a lot
of depth complexity (which is not the case in these tests).

Plus, like Marco said, the GTX 970 is gaming optimized, it just doesn't
have a lot of DP FLOPS.
GTX 970 has 190 DP FLOPS, compare that with the 3494 SP FLOPS it can
achieve. In comparison my old GTX TITAN card has 1300 DP FLOPS and ~4500 SP
FLOPS. A modern V100 accelerator can do like 7450 DP FLOPS. It also costs
an arm and a leg though. A workstation with 4x V100 accelerators costs $69k.

Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Vasco Alexandre da Silva Costa
2017-06-21 15:50:12 UTC
Permalink
On Wed, Jun 21, 2017 at 4:28 PM, Vasco Alexandre da Silva Costa <
On Wed, Jun 21, 2017 at 1:12 PM, Marco Domingues <
Post by Marco Domingues
I followed your suggestion and added a lighting model to the ansi c code
that does the white shading :) And uploaded some more new images of the
results and also added a table with the time comparison.
Right now you got rt_boolweave() and bool_eval() working. This won't be
complete until you implement rt_boolfinal(). So yes there are still some
limitations but those should be solved further along the shedule.
Post by Marco Domingues
Running the OCL code in my CPU is faster than running it on a Nvidia GTX
970, which isn't a great board to perform double-precision calculations. I
have included those times for reference.
http://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#21_June
There are lots of optimizations which can be done after we get things
- There is some CPU<-> GPU traffic I left over between, the count_hits()
and store_hits() kernels, to perform a prefix sum which should be done on
the GPU.
- The RPN tree I implemented for bool_eval() is like memory space optimal
and code minimal but not traversal steps optimal. It should be possible to
evaluate the expression in less steps with a different expression
representation.
- There's bitscanning going on in bool_eval() to check which objects are
intersected in the partitions which could probably be further optimized
with bit ops (e.g. with OpenCL clz and shifts).
- The data structure that is used to store the partitions is not optimal,
since we're basically using a dynamic array when we could be using a
dynamic list, given that it's possible to insert partitions in the middle
of the partition list. This should make a difference in scenes with a lot
of depth complexity (which is not the case in these tests).
Plus, like Marco said, the GTX 970 is gaming optimized, it just doesn't
have a lot of DP FLOPS.
GTX 970 has 190 DP FLOPS, compare that with the 3494 SP FLOPS it can
achieve. In comparison my old GTX TITAN card has 1300 DP FLOPS and ~4500 SP
FLOPS. A modern V100 accelerator can do like 7450 DP FLOPS. It also costs
an arm and a leg though. A workstation with 4x V100 accelerators costs $69k.
PS: That's supposed to be GFLOPS. Not FLOPS. :-P
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-07-17 22:23:43 UTC
Permalink
Hello,

As combined, I have made some improvements on the code (based on Vasco’s feedback) and implemented an overlap handler to resolve the issue of overlapping partitions, completing the rt_boolfinal kernel and submitted an initial version of the code: the patch was created against the opencl code branch and can be found here https://sourceforge.net/p/brlcad/patches/472/ <https://sourceforge.net/p/brlcad/patches/472/>.

As I state in the ticket, the patch contains the code to weave segments into partitions (rt_boolweave), the code to evaluate partitions (rt_boolfinal) and the shading of the evaluated partitions.
There are still some bugs in the code as:

Incorrect shading for some partitions (some normals seem to be off)
Some more complex scenes have less partitions evaluated than expected. (I believe this issue comes from some partitions reporting overlap when they shouldn’t, but still have to further investigate what is causing that)
Apart from that, the results look promising and ok for most scenes I have tested (this includes the scenes 'share/db/operators.g' and 'share/db/boolean-ops.g'). Some images can be found here: https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#12_-_14_July <https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#12_-_14_July>

Right now, I am using a bitarray to represent the regions involved in each partition, and this seems to be an inefficient solution, because of the high number of regions that some complex scenes have (to evaluate one partition and to resolve any overlap that may occur, I have to iterate over all the regions and test the bits, which can be very slow for some scenes).
The ‘share/db/goliath.g’ scene has a total of 1447 regions, and despite that, the max number of regions involved in one partition is 15 (from some profiling I made over the ansi c code).

In the next days I’m planning to work on replacing the bitarray to a solution that uses a list to represent the regions involved in each partition (regiontable), which despite using more memory, should be considerably faster.

Cheers,
Marco
Vasco Alexandre da Silva Costa
2017-07-19 01:50:45 UTC
Permalink
On Mon, Jul 17, 2017 at 11:23 PM, Marco Domingues <
Post by Marco Domingues
Hello,
As combined, I have made some improvements on the code (based on Vasco’s
feedback) and implemented an overlap handler
This is one of those words where Portuguese does not translate well into
English.
Instead of 'combined' it is better to say as 'discussed' or 'agreed' upon.
Post by Marco Domingues
- Incorrect shading for some partitions (some normals seem to be off)
This is quite likely because of the issue with primitive ids being in BVH
order on the OpenCL side that we talked about on IRC.
Post by Marco Domingues
- Some more complex scenes have less partitions evaluated than
expected. (I believe this issue comes from some partitions reporting
overlap when they shouldn’t, but still have to further investigate what is
causing that)
Apart from that, the results look promising and ok for most scenes I have
tested (this includes the scenes 'share/db/operators.g' and
https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#12_-_14_July
Right now, I am using a bitarray to represent the regions involved in each
partition, and this seems to be an inefficient solution, because of the
high number of regions that some complex scenes have (to evaluate one
partition and to resolve any overlap that may occur, I have to iterate over
all the regions and test the bits, which can be very slow for some scenes).
The ‘share/db/goliath.g’ scene has a total of 1447 regions, and despite
that, the max number of regions involved in one partition is 15 (from some
profiling I made over the ansi c code).
If the issue is with the partition bitarrays try making an intermediate
step where the bitarrays from rt_boolweave are converted into integer lists
prior to rt_boolfinal or the render phase.
I think this algorithm should be simpler than modifying rt_boolweave to use
lists. For one you won't need dynamic lists. You can use popcnt to count
the number of bits in each bitarray (complexity is amount of bitarray
words) then sum that to compute the amount of memory to allocate and then
allocate an array where you store the ids of the region bits set on each
partition.

If you want to reduce the amount of bits on the partition segment
bitarrays, you can also consider that you only need a bitarray for a
partition to be the size of the amount of segments in that specific ray,
not the max amount of segments on the largest ray segment list. This
complicates the code a bit but should significantly reduce the amount of
memory required to store the bitarray on complex scenes.

As for the regiontable bitarray in rt_boolfinal, I don't see you initialize
it to zero in the OpenCL code, plus it seems to me like all the ray threads
are writing to the same memory space so there are bound to be errors in it,
i.e.:

build_regiontable()'s pp_idx = current_index which is the id of a
partition. You use:
set(regiontable, pp_idx + (rt_index - 1), m) i.e.
regiontable[pp_idx + (rt_index - 1) + bindex(m)] & bmask(m)) i.e.
regiontable[current_index + (total_regions/32 + 1 - 1) + (m >> 5)] & (1 <<
m)

this doesn't look correct. this seems more like what you want:
set(regiontable, current_index * rt_index, m)

Think about it. Two adjacent partitions will write in overlapping memory
areas of the regiontable bitarray. You don't want that.
Also I think this same error is in other bitarrays you have.

Also your code to iterate the bitarray is still inefficient. You check the
leading zeros in an uint but then you iterate the bits one by one over that
to iterate over all set bits. This should be faster:

uint mask; /* initialized with some bits set. */
uint i, n;
i = 0;
while (mask != 0) {
n = clz(mask);
i += n;
/* bit 32-i is set */
mask <<= (n+1);
}

or some variation of this.
Post by Marco Domingues
In the next days I’m planning to work on replacing the bitarray to a
solution that uses a list to represent the regions involved in each
partition (regiontable), which despite using more memory, should be
considerably faster.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Vasco Alexandre da Silva Costa
2017-07-19 01:53:18 UTC
Permalink
On Wed, Jul 19, 2017 at 2:50 AM, Vasco Alexandre da Silva Costa <
***@gmail.com> wrote:

Duh.
Post by Vasco Alexandre da Silva Costa
Also your code to iterate the bitarray is still inefficient. You check the
leading zeros in an uint but then you iterate the bits one by one over that
uint mask; /* initialized with some bits set. */
uint i, n;
i = 0;
while (mask != 0) {
n = clz(mask);
i += n;
/* bit 32-i is set */
mask <<= (n+1);
}
or some variation of this.
Post by Marco Domingues
In the next days I’m planning to work on replacing the bitarray to a
solution that uses a list to represent the regions involved in each
partition (regiontable), which despite using more memory, should be
considerably faster.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Vasco Alexandre da Silva Costa
2017-07-19 01:54:39 UTC
Permalink
On Wed, Jul 19, 2017 at 2:50 AM, Vasco Alexandre da Silva Costa <
***@gmail.com> wrote:

Google ate my e-mail. :-)

Also your code to iterate the bitarray is still inefficient. You check the
Post by Vasco Alexandre da Silva Costa
leading zeros in an uint but then you iterate the bits one by one over that
uint mask; /* initialized with some bits set. */
uint i, n;
i = 0;
while (mask != 0) {
n = clz(mask);
i += n;
/* bit 32-i is set */
mask <<= (n+1);
i++;
Post by Vasco Alexandre da Silva Costa
}
or some variation of this.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Christopher Sean Morrison
2017-07-19 04:56:53 UTC
Permalink
Post by Vasco Alexandre da Silva Costa
Google ate my e-mail. :-)
:)
Using bit arrays seems like such an antiquated method. I mean it’s certainly somewhat memory-friendly (in that it’s compact, minimal L2 contention), avoids pointers, etc, but what are some alternatives? I mean, I could certainly imagine a lock-free version that has nice properties and is still essentially a bit array for book-keeping, but I don’t have broader intuition. I have to imagine they ported simply because that’s what the old code does/did.

Getting it working is certainly first priority, but I’m just curious what the thoughts are on them.

Cheers!
Sean
Vasco Alexandre da Silva Costa
2017-07-19 10:42:54 UTC
Permalink
Post by Vasco Alexandre da Silva Costa
Also your code to iterate the bitarray is still inefficient. You check the
Post by Vasco Alexandre da Silva Costa
leading zeros in an uint but then you iterate the bits one by one over that
Using bit arrays seems like such an antiquated method. I mean it’s
certainly somewhat memory-friendly (in that it’s compact, minimal L2
contention), avoids pointers, etc, but what are some alternatives? I mean,
I could certainly imagine a lock-free version that has nice properties and
is still essentially a bit array for book-keeping, but I don’t have broader
intuition. I have to imagine they ported simply because that’s what the
old code does/did.
Getting it working is certainly first priority, but I’m just curious what
the thoughts are on them.
Well the code has like two phases. In one you determine which segments are
in each partition (without duplicates) and in the other you have to iterate
over the segments in a partition. The ANSI C code uses
bu_ptbl_ins_unique() in the first phase with an O(N) per insertion cost and
iterates a double-linked list with pointer chasing in the second phase. So
a bitarray in the first phase and an ordered array in the second phase
would actually be an improvement (at least in the theoretical number of
ops). Memory usage and memory coherency might still be an issue on the GPU
respectively though. But we need to examine better the access patterns and
GPU usage count to determine if we need a more advanced data structure. Or
to better know the traits of the one we would need to use.

It is basically a collection of dictionaries of integers in a well known
range that you initially treat like a set but that afterwards you need to
iterate.

An alternative implementation for the dictionaries would be as an ordered
list.

Or we could just build the list unsorted with duplicates, sort the list
(e.g. radix sort), and then remove the duplicates, which is what typically
is done in GPU algorithms because it is more memory coherent. But for that
we need to determine an upper bound for the used memory before the
algorithm begins and make sure that the bounds are reasonable. We haven't
done that yet. In fact when I glanced at it it seemed like the worst case
total memory used was just too large.

We would also need an efficient sorting algorithm, which would take
considerable time to do from scratch (i.e. you could waste an entire GSoC
in it), there is source code for a radix sort algorithm from Thrust in
PyOCL we could use. But that is Apache License 2.0, which you guys had
concerns about before, even though I think it should be 100% compatible
with LGPL v2.1 since it's like BSD with a patent clause and LGPL v2.1 also
has a patent clause but IANAL.

There is nothing wrong with KISS on a GPU. It is typically an order of
magnitude (that's an underestimate) more complex to implement advanced data
structures on a GPU and keep it 100% used at all times.

Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-07-19 18:05:58 UTC
Permalink
Post by Marco Domingues
Hello,
As combined, I have made some improvements on the code (based on Vasco’s feedback) and implemented an overlap handler
This is one of those words where Portuguese does not translate well into English.
Instead of 'combined' it is better to say as 'discussed' or 'agreed' upon.
Incorrect shading for some partitions (some normals seem to be off)
This is quite likely because of the issue with primitive ids being in BVH order on the OpenCL side that we talked about on IRC.
Some more complex scenes have less partitions evaluated than expected. (I believe this issue comes from some partitions reporting overlap when they shouldn’t, but still have to further investigate what is causing that)
Apart from that, the results look promising and ok for most scenes I have tested (this includes the scenes 'share/db/operators.g' and 'share/db/boolean-ops.g'). Some images can be found here: https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#12_-_14_July <https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#12_-_14_July>
Right now, I am using a bitarray to represent the regions involved in each partition, and this seems to be an inefficient solution, because of the high number of regions that some complex scenes have (to evaluate one partition and to resolve any overlap that may occur, I have to iterate over all the regions and test the bits, which can be very slow for some scenes).
The ‘share/db/goliath.g’ scene has a total of 1447 regions, and despite that, the max number of regions involved in one partition is 15 (from some profiling I made over the ansi c code).
If the issue is with the partition bitarrays try making an intermediate step where the bitarrays from rt_boolweave are converted into integer lists prior to rt_boolfinal or the render phase.
I think this algorithm should be simpler than modifying rt_boolweave to use lists. For one you won't need dynamic lists. You can use popcnt to count the number of bits in each bitarray (complexity is amount of bitarray words) then sum that to compute the amount of memory to allocate and then allocate an array where you store the ids of the region bits set on each partition.
If you want to reduce the amount of bits on the partition segment bitarrays, you can also consider that you only need a bitarray for a partition to be the size of the amount of segments in that specific ray, not the max amount of segments on the largest ray segment list. This complicates the code a bit but should significantly reduce the amount of memory required to store the bitarray on complex scenes.
set(regiontable, pp_idx + (rt_index - 1), m) i.e.
regiontable[pp_idx + (rt_index - 1) + bindex(m)] & bmask(m)) i.e.
regiontable[current_index + (total_regions/32 + 1 - 1) + (m >> 5)] & (1 << m)
set(regiontable, current_index * rt_index, m)
Think about it. Two adjacent partitions will write in overlapping memory areas of the regiontable bitarray. You don't want that.
Also I think this same error is in other bitarrays you have.
You are right, that was wrong and it was causing scenes with more than 32 regions to have missing evaluated partitions! Thanks for pointing that out!
Post by Marco Domingues
uint mask; /* initialized with some bits set. */
uint i, n;
i = 0;
while (mask != 0) {
n = clz(mask);
i += n;
/* bit 32-i is set */
mask <<= (n+1);
}
or some variation of this.
In the next days I’m planning to work on replacing the bitarray to a solution that uses a list to represent the regions involved in each partition (regiontable), which despite using more memory, should be considerably faster.
I fixed that section of the code and now it is iterating only over set bits! Thanks for the help!

I also found the reason for the weird results when drawing geometry with the command “e *”. It is a bug in the overlap handler that also causes some partitions to be missing in scenes with occurrence of overlaps. I’m working on a fix!

Thanks for the feedback on the code! As soon as I fix the problem with the overlap handler I will update the patch in the svn, and hopefully we can start looking into optimizations!

P.S: I noticed that the ‘bool.cl’ file is missing in the opencl branch

Cheers,
Marco
Post by Marco Domingues
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
BRL-CAD Developer mailing list
https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Vasco Alexandre da Silva Costa
2017-07-19 21:26:10 UTC
Permalink
On 19 Jul 2017, at 02:50, Vasco Alexandre da Silva Costa <
Think about it. Two adjacent partitions will write in overlapping memory
areas of the regiontable bitarray. You don't want that.
Also I think this same error is in other bitarrays you have.
You are right, that was wrong and it was causing scenes with more than 32
regions to have missing evaluated partitions! Thanks for pointing that out!
That's what I'm here for. :-) Eventually you'll get used to it. Sometimes,
when I have a hard time understanding memory issues, I find it helps to
draw a little schematic on a piece of paper and then trace a couple of
iterations on it to help me visualize the problem.

I fixed that section of the code and now it is iterating only over set
bits! Thanks for the help!
This one was not a bug in your code. But it should boost performance on
very sparse bitarrays and it's a simple change to make.
Imagine you have a 32-bit uint where the only bits set are the first and
last bits for example. It should be like 3 loop iterations instead of like
32 loop iterations.
I also found the reason for the weird results when drawing geometry with
the command “e *”. It is a bug in the overlap handler that also causes some
partitions to be missing in scenes with occurrence of overlaps. I’m working
on a fix!
Great! Hopefully we'll squash all of these bugs.
Thanks for the feedback on the code! As soon as I fix the problem with the
overlap handler I will update the patch in the svn, and hopefully we can
start looking into optimizations!
Yes, once the code stabilizes we should profile it on a set of test scenes
(e.g. operators.g, goliath.g), then figure out where those bottlenecks are
located.
P.S: I noticed that the ‘bool.cl’ file is missing in the opencl branch
Duh. Nice catch! My mistake. :-P I'll get on to it.

Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-07-21 17:41:48 UTC
Permalink
Hi,

I fixed the problem with the overlap handler and uploaded the patch in the svn (https://sourceforge.net/p/brlcad/patches/472/ <https://sourceforge.net/p/brlcad/patches/472/>). I’ve also uploaded some more new images in my dev log with renders of scenes from the ‘share/db’ directory, which can be found here: https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#20-21_July <https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#20-21_July>

There is still the issue with the materials, that I can look into next.

Complex scenes like the havoc.g and the goliath.g, are still taking too much time to render, which I believe comes from the operation of building the regiontable (commenting the function call results in huge differences in execution times), so I think we should look into an alternative to build the regiontable. Right now, to build the regiontable I iterate over all the segments in the partition, and then over the regions to test if the segment bits are in the boolean tree of the region. Perhaps this process could be optimised if the segments had the ‘soltab’ structure, and the respective information on the regions involved for that segment.

But I will gather some statistics for these scenes like execution times, total memory allocated, max segments per partition, max regions involved per partition and maybe we will have some other clues of possible optimisations to make on the code!

Cheers,
Marco
Vasco Alexandre da Silva Costa
2017-07-21 20:20:41 UTC
Permalink
Post by Marco Domingues
Hi,
There is still the issue with the materials, that I can look into next.
Right. I made a simple port of the Phong shader from
'liboptical/sh_plastic.c' and added basic material support to opencl librt.
However this was done near the end of the code period and it was never 100%
tested. It has issues with regions and getting the correct materials. Since
this was dependent on the CSG regions code I left finishing this for later.

Are you sure it is computing the surface normals correctly? Try to compare
the output with CSG enabled while rendering 'operators.g' in "Diffuse
Normals" mode vs the ANSI C code. You might have to change shade_segs() in '
rt.cl' to get this to work.

Complex scenes like the havoc.g and the goliath.g, are still taking too
Post by Marco Domingues
much time to render, which I believe comes from the operation of building
the regiontable (commenting the
Ok but how long is "too much time"? Also how long does the ANSI C code take
to render the same scene?
Post by Marco Domingues
function call results in huge differences in execution times), so I think
we should look into an alternative to build the regiontable. Right now, to
build the regiontable I iterate over all the segments in the partition, and
then over the regions to test if the segment bits are in the boolean tree
of the region. Perhaps this process could be optimised if the segments had
the ‘soltab’ structure, and the respective information on the regions
involved for that segment.
I'll try to read the OpenCL CSG implementation source code this weekend to
compare it in terms of complexity to the ANSI C source code to see if I can
spot major differences.
Post by Marco Domingues
But I will gather some statistics for these scenes like execution times,
total memory allocated, max segments per partition, max regions involved
per partition and maybe we will have some other clues of possible
optimisations to make on the code!
Yes, we also need to measure the time it spends on each of the kernels for
a given scene at the very least. This would be a good time for you to learn
how to use a profiler:
http://gpuopen.com/compute-product/codexl/

AMD's CodeXL should be able to profile both CPU and GPU binaries and tell
you much time is spent on each function/kernel and the number of cache
misses, etc.

There is also the NVIDIA Visual Profiler:
https://stackoverflow.com/questions/29068229/is-there-a-way-to-profile-an-opencl-or-a-pyopencl-program

But it is like they purposefully nerfed it for OpenCL code. So you are
probably better off with AMD's CodeXL.

I think you are using a CPU based OpenCL implementation in Linux right? You
should also be able to use the OProfile and KCacheGrind tools in case you
have issues with CodeXL.

Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-07-22 13:59:12 UTC
Permalink
Post by Marco Domingues
Hi,
There is still the issue with the materials, that I can look into next.
Right. I made a simple port of the Phong shader from 'liboptical/sh_plastic.c' and added basic material support to opencl librt. However this was done near the end of the code period and it was never 100% tested. It has issues with regions and getting the correct materials. Since this was dependent on the CSG regions code I left finishing this for later.
Are you sure it is computing the surface normals correctly? Try to compare the output with CSG enabled while rendering 'operators.g' in "Diffuse Normals" mode vs the ANSI C code. You might have to change shade_segs() in 'rt.cl <http://rt.cl/>' to get this to work.
Hm I will try to replicate that lighting model and see what it looks like!
Post by Marco Domingues
Complex scenes like the havoc.g and the goliath.g, are still taking too much time to render, which I believe comes from the operation of building the regiontable (commenting the
Ok but how long is "too much time"? Also how long does the ANSI C code take to render the same scene?
Well, the ‘rt -z1 -l5’ command takes 8,103 seconds for the havoc.g scene when rendering with the GPU, and this same scene renders in 0,558 seconds when using the ANSI C code. Despite that, when I comment the call to the ‘build_regiontable’ function, the OpenCl code only takes 0,14 seconds. So the process of building the regiontable, evaluating partitions and resolving overlaps is causing a major bottleneck for this scene.

The operators.g scene takes 0,027 seconds when using the ‘rt -z1 -l5’ command, and 0,054 when using the ‘rt’ command.
Post by Marco Domingues
function call results in huge differences in execution times), so I think we should look into an alternative to build the regiontable. Right now, to build the regiontable I iterate over all the segments in the partition, and then over the regions to test if the segment bits are in the boolean tree of the region. Perhaps this process could be optimised if the segments had the ‘soltab’ structure, and the respective information on the regions involved for that segment.
I'll try to read the OpenCL CSG implementation source code this weekend to compare it in terms of complexity to the ANSI C source code to see if I can spot major differences.
But I will gather some statistics for these scenes like execution times, total memory allocated, max segments per partition, max regions involved per partition and maybe we will have some other clues of possible optimisations to make on the code!
http://gpuopen.com/compute-product/codexl/ <http://gpuopen.com/compute-product/codexl/>
AMD's CodeXL should be able to profile both CPU and GPU binaries and tell you much time is spent on each function/kernel and the number of cache misses, etc.
https://stackoverflow.com/questions/29068229/is-there-a-way-to-profile-an-opencl-or-a-pyopencl-program <https://stackoverflow.com/questions/29068229/is-there-a-way-to-profile-an-opencl-or-a-pyopencl-program>
But it is like they purposefully nerfed it for OpenCL code. So you are probably better off with AMD's CodeXL.
I think you are using a CPU based OpenCL implementation in Linux right? You should also be able to use the OProfile and KCacheGrind tools in case you have issues with CodeXL.
Regards,
Thanks for sharing, I will take a look into these tools and will create a document with some more detailed measurements for the scenes in the ‘share/db’ directory!

Cheers!
Post by Marco Domingues
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
BRL-CAD Developer mailing list
https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Vasco Alexandre da Silva Costa
2017-07-24 23:33:39 UTC
Permalink
Post by Marco Domingues
Well, the ‘rt -z1 -l5’ command takes 8,103 seconds for the havoc.g scene
when rendering with the GPU, and this same scene renders in 0,558 seconds
when using the ANSI C code. Despite that, when I comment the call to the
‘build_regiontable’ function, the OpenCl code only takes 0,14 seconds. So
the process of building the regiontable, evaluating partitions and
resolving overlaps is causing a major bottleneck for this scene.
The operators.g scene takes 0,027 seconds when using the ‘rt -z1 -l5’
command, and 0,054 when using the ‘rt’ command.
So I looked at your code and yes the major bottleneck seems to be the
build_regiontable() function. The ANSI C code precomputes the list of
regions each solid is in before starting the boolean evaluation, around the
time it parses and optimizes the boolean tree, this is stored in
stp->st_regions. It means this is precomputed once, instead of being
computed for every partition like what you're doing on your code. This
should be the main cause for the slowdown.

Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Vasco Alexandre da Silva Costa
2017-07-25 16:16:16 UTC
Permalink
I have looked at your update for the 24th of July. You have to make sure
when the normals are computed the right primitive is being evaluated, also
in some cases the surface normal needs to be flipped. Notice that in
rt_boolweave() there is a pt_inflip and pt_outflip parameter. Then in the
ANSI C code the RT_HIT_NORMAL() macro is used to determine the actual
surface normal, i.e.:

https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/include/rt/hit.h

/**
* Compute normal into (_hitp)->hit_normal.
*
* Set flip-flag accordingly depending on boolean logic, as one hit
* may be shared between multiple partitions with different flip
* status.
*
* Example: box.r = box.s - sph.s; sph.r = sph.s
*
* Return the post-boolean normal into caller-provided _normal vector.
*/
#define RT_HIT_NORMAL(_normal, _hitp, _stp, _unused, _flipflag) { \
RT_CK_HIT(_hitp); \
RT_CK_SOLTAB(_stp); \
RT_CK_FUNCTAB((_stp)->st_meth); \
{ \
void *_n = (void *)_normal; \
if ((_stp)->st_meth->ft_norm) { \
(_stp)->st_meth->ft_norm(_hitp, _stp, (_hitp)->hit_rayp); \
} \
if (_n != NULL) { \
int _f = (int)_flipflag; \
if (_f) { \
VREVERSE((fastf_t *)_normal, (_hitp)->hit_normal); \
} else { \
VMOVE((fastf_t *)_normal, (_hitp)->hit_normal); \
} \
} \
} \
}

Where:
/** @brief Reverse the direction of 3D vector `b' and store it in `a'. */
#define VREVERSE(a, b) do { \
(a)[X] = -(b)[X]; \
(a)[Y] = -(b)[Y]; \
(a)[Z] = -(b)[Z]; \
} while (0)
/** @brief Transfer 3D vector at `b' to vector at `a'. */
#define VMOVE(a, b) do { \
(a)[X] = (b)[X]; \
(a)[Y] = (b)[Y]; \
(a)[Z] = (b)[Z]; \
} while (0)

It is called like this in src/liboptical/shade.c:
/* Get surface normal for hit point */
RT_HIT_NORMAL(swp->sw_hit.hit_normal, &(swp->sw_hit),
pp->pt_inseg->seg_stp, &(ap->a_ray), pp->pt_inflip);


On Tue, Jul 25, 2017 at 12:33 AM, Vasco Alexandre da Silva Costa <
On Sat, Jul 22, 2017 at 2:59 PM, Marco Domingues <
Post by Marco Domingues
Well, the ‘rt -z1 -l5’ command takes 8,103 seconds for the havoc.g scene
when rendering with the GPU, and this same scene renders in 0,558 seconds
when using the ANSI C code. Despite that, when I comment the call to the
‘build_regiontable’ function, the OpenCl code only takes 0,14 seconds. So
the process of building the regiontable, evaluating partitions and
resolving overlaps is causing a major bottleneck for this scene.
The operators.g scene takes 0,027 seconds when using the ‘rt -z1 -l5’
command, and 0,054 when using the ‘rt’ command.
So I looked at your code and yes the major bottleneck seems to be the
build_regiontable() function. The ANSI C code precomputes the list of
regions each solid is in before starting the boolean evaluation, around the
time it parses and optimizes the boolean tree, this is stored in
stp->st_regions. It means this is precomputed once, instead of being
computed for every partition like what you're doing on your code. This
should be the main cause for the slowdown.
Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-07-25 18:15:13 UTC
Permalink
Post by Marco Domingues
Well, the ‘rt -z1 -l5’ command takes 8,103 seconds for the havoc.g scene when rendering with the GPU, and this same scene renders in 0,558 seconds when using the ANSI C code. Despite that, when I comment the call to the ‘build_regiontable’ function, the OpenCl code only takes 0,14 seconds. So the process of building the regiontable, evaluating partitions and resolving overlaps is causing a major bottleneck for this scene.
The operators.g scene takes 0,027 seconds when using the ‘rt -z1 -l5’ command, and 0,054 when using the ‘rt’ command.
So I looked at your code and yes the major bottleneck seems to be the build_regiontable() function. The ANSI C code precomputes the list of regions each solid is in before starting the boolean evaluation, around the time it parses and optimizes the boolean tree, this is stored in stp->st_regions. It means this is precomputed once, instead of being computed for every partition like what you're doing on your code. This should be the main cause for the slowdown.
Hm I think I could implement something similar. In the prep function I could build a buffer with the list of regions for each primitive, and then use the 'seg->sti' to index the buffer with the regions involved. This should be similar to what I am already doing to iterate over the boolean trees.

I’ve just finished gathering the statistics on the scenes in the ‘share/db’ directory. I will attach the pdf with the results.

I couldn’t really figure out how to use the profiling tools you mentioned. Well, the AMD CodeXL only allowed me to use CPU Profiling Time-based Sampling with my hardware, but I couldn’t really understand the output from it. The other tools I had trouble installing/running with BRL-CAD, so I ended gathering the data with the output from the ‘rt’ command.

Maybe its not very accurate, but I tried to compare the code with the different kernels enabled/disabled and it pretty much confirms the bottleneck in the rt_boolfinal kernel.

Regarding your last email, yes seems like I forgot to include the inflip and outflip in the partitions, which I will fix ASAP. I will work on that and also on the regiontable bottleneck and see what I can get!

Thanks for the help!

Regards,
Marco
Post by Marco Domingues
Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
BRL-CAD Developer mailing list
https://lists.sourceforge.net/lists/listinfo/brlcad-devel <https://lists.sourceforge.net/lists/listinfo/brlcad-devel>
Vasco Alexandre da Silva Costa
2017-07-25 19:49:44 UTC
Permalink
Post by Marco Domingues
I’ve just finished gathering the statistics on the scenes in the
‘share/db’ directory. I will attach the pdf with the results.
It is amazing how c) and d) are so much slower than b). It should have only
been like 2x slower. I guess this is due to the larger working set in
memory. With the list of segments spread over a large amount of memory the
'shade_segs' phase will have poor memory coherency. It is particularly bad
in goliath.g which is the scene with most depth complexity.

It does not make sense that g) is faster than f) though.

You should use more appropriate measures. i.e. 's' or 'ms' for each cell,
depending on much time it takes, instead of fractions.
Or MB vs KB, etc. Also use the same number format everywhere (e.g. %.2f)
and use American number format for the fraction separator i.e. '.' vs ','.

In Table 3 "Other metrics" what does xx/yy in the "Partitions" mean? Is
this used vs allocated partitions? The amount of wasted memory seems
particularly bad in truck.g if that is the case. I would be nice to reduce
memory consumption with partitions further. Still, at least all these
scenes would easily fit into the typical memory of a graphics card with
under 512 MB total footprint.
Post by Marco Domingues
I couldn’t really figure out how to use the profiling tools you mentioned.
Well, the AMD CodeXL only allowed me to use CPU Profiling Time-based
Sampling with my hardware, but I couldn’t really understand the output from
it. The other tools I had trouble installing/running with BRL-CAD, so I
ended gathering the data with the output from the ‘rt’ command.
We'll have to talk about this over Skype I guess. I'm going to be a bit
busy the next couple of days though so perhaps we'll have to do it Friday
or early next week. Still the statistics you gathered are enough to start
optimizing the code.
Post by Marco Domingues
Maybe its not very accurate, but I tried to compare the code with the
different kernels enabled/disabled and it pretty much confirms the
bottleneck in the rt_boolfinal kernel.
Everywhere you see loops within loops, large branch heavy kernels, or
memory walks with large strides, it is a good hint that code could be
further optimized.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Marco Domingues
2017-07-25 20:54:20 UTC
Permalink
Post by Marco Domingues
I’ve just finished gathering the statistics on the scenes in the ‘share/db’ directory. I will attach the pdf with the results.
It is amazing how c) and d) are so much slower than b). It should have only been like 2x slower. I guess this is due to the larger working set in memory. With the list of segments spread over a large amount of memory the 'shade_segs' phase will have poor memory coherency. It is particularly bad in goliath.g which is the scene with most depth complexity.
It does not make sense that g) is faster than f) though.
Yes I also found it strange, but when I run the code in the GPU with the ‘shade_segs’ kernel disabled the frame buffer gets filled with noise, so I don’t know if this can be the cause. I can also repeat the tests for g) and f), to make sure that I did it right.
Post by Marco Domingues
You should use more appropriate measures. i.e. 's' or 'ms' for each cell, depending on much time it takes, instead of fractions.
Or MB vs KB, etc. Also use the same number format everywhere (e.g. %.2f) and use American number format for the fraction separator i.e. '.' vs ','.
In Table 3 "Other metrics" what does xx/yy in the "Partitions" mean? Is this used vs allocated partitions? The amount of wasted memory seems particularly bad in truck.g if that is the case. I would be nice to reduce memory consumption with partitions further. Still, at least all these scenes would easily fit into the typical memory of a graphics card with under 512 MB total footprint.
I will format the tables properly, thanks for the advice! And yes, in table 3 xx/yy in the “Partitions” means used/allocated partitions.
Post by Marco Domingues
I couldn’t really figure out how to use the profiling tools you mentioned. Well, the AMD CodeXL only allowed me to use CPU Profiling Time-based Sampling with my hardware, but I couldn’t really understand the output from it. The other tools I had trouble installing/running with BRL-CAD, so I ended gathering the data with the output from the ‘rt’ command.
We'll have to talk about this over Skype I guess. I'm going to be a bit busy the next couple of days though so perhaps we'll have to do it Friday or early next week. Still the statistics you gathered are enough to start optimizing the code.
Seems good! Perhaps we could do this early next week? I’m not sure if I will be able to be in the computer on Friday afternoon if that is okay.
Meanwhile I will be working on optimizing the process of building the regiontable and fixing the issue with the normals!

Cheers,
Marco
Post by Marco Domingues
Maybe its not very accurate, but I tried to compare the code with the different kernels enabled/disabled and it pretty much confirms the bottleneck in the rt_boolfinal kernel.
Everywhere you see loops within loops, large branch heavy kernels, or memory walks with large strides, it is a good hint that code could be further optimized.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
BRL-CAD Developer mailing list
https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Vasco Alexandre da Silva Costa
2017-07-26 05:54:54 UTC
Permalink
On 25 Jul 2017, at 20:49, Vasco Alexandre da Silva Costa <
It is amazing how c) and d) are so much slower than b). It should have
only been like 2x slower. I guess this is due to the larger working set in
memory. With the list of segments spread over a large amount of memory the
'shade_segs' phase will have poor memory coherency. It is particularly bad
in goliath.g which is the scene with most depth complexity.
I take my comment back. It seems this is just because it needs to traverse
the segments 3x vs 1x.
So it usually takes around 3x as much time.

Regarding memory consumption and complexity I have more comments to make.
[[
I've been thinking about how we can further reduce memory consumption...

sizeof() this:

struct hit {
double3 hit_point;
double3 hit_normal;
double3 hit_vpriv;
double hit_dist;
int hit_surfno;
};

vs this:

struct hit {
double hit_point[3];
double hit_normal[3];
double hit_vpriv[3];
double hit_dist;
int hit_surfno;
};

8*4*3+8+4 = 108 bytes vs
8*3*3+8+4 = 84 bytes

i.e. a compression ratio of 1.29x on struct hit BUT at the expense of lots
of source code changes on the primitives.

As for struct partition, we could use pointers to hits or indices to hits
instead of storing the actual hits. like:
idx = (seg_idx << 1) + (in|out)bit .

This should save a lot more memory on the partitions. Like 238 bytes vs 30
bytes i.e. a compression ratio of 7.93x.
However it would increase memory trashing quite a lot. So we should at
least store cache the 'hit_dist' for the hits.
This would be 46 bytes per partition i.e. a compression ratio of 5.17x.
With struct hit pointers it would be 54 bytes per partition i.e. a
compression ratio of 4.40x (with 64-bit pointers).
]]

To put it short: my advice is that you change struct partition from storing
copies of the struct hits to having pointers to the struct hits. i.e.

struct partition {
global struct hit *inhit;
global struct hit *outhit;
uint inseg;
uint outseg;
...
};

That should save quite a lot of memory at the expense of some memory
pointer chasing. Which can be mitigated with a 'hit_dist' cache for the
inhit and outhit.

Also, let me get this straight, the ANSI C code rt_boolfinal gets
partitions from an input queue, evaluates them, and puts the valid ones in
an output queue. You also evaluate them, but then you just mark the valid
partitions as such. While this does save memory, it may make the final
shading stage take more time than it would otherwise because it needs to
skip invalid partitions.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Christopher Sean Morrison
2017-07-26 04:26:11 UTC
Permalink
I’ve just finished gathering the statistics on the scenes in the ‘share/db’ directory. I will attach the pdf with the results.
Thanks for sharing, very informative, interesting! If I’m reading the numbers correctly, though, it’s not making complete sense (e.g., f repeatedly slower than g). Any explanation?

One point I see is that there’s not enough work. Not just model complexity, but number of rays. I suggest using -s1024 going forward, which not just quadruples the load and is a common render size, but also converts to a millions of rays per second metric more directly.

For your "Other metrics" table at the end, be wary of making assumptions (e.g., basing an array size anywhere near the max segs or regions in a partition values) because our sample geometry are not representative of “real" models. They’re mostly tiny demos.

Production models tend to have an order of magnitude or two more regions (in the 5k-50k ballpark). If you made a script to generate that table, I’d be happy to run it on some real assets for comparison.

Cheers!
Sean
Christopher Sean Morrison
2017-07-25 04:13:01 UTC
Permalink
I fixed the problem with the overlap handler and uploaded the patch in the svn (https://sourceforge.net/p/brlcad/patches/472/). I’ve also uploaded some more new images in my dev log with renders of scenes from the ‘share/db’ directory, which can be found here: https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#20-21_July
This is really good looking progress.

A note for debugging that might come in handy are features related to reshooting specific rays. If you render a view via rt and right-click (maybe middle-mouse click, depending on your settings) in the framebuffer window, it should report pixel values like this per click:

At image (216, 219), real RGB=(141 141 141)
At image (213, 201), real RGB=(157 156 157)
At image (257, 224), real RGB=( 33 33 33)
At image (268, 171), real RGB=( 82 82 82)

You can then reissue the exact same rt command (e.g., rt -z1 -l5 …) with the -b "X Y" option (e.g., rt -b "257 224” …) to reshoot just that ONE single ray. That can be much easier to debug.

Note there are also a slew of flags you can set for diagnostic debug printing as well including one specific to the boolean evaluation (-x2000). This can be used by itself or with the -Q X,Y option (note the comma) to reshoot all rays, but turn on debugging for just one specific pixel.

Of course, debugging from the OpenCL side presents a challenge, but seeing the C code diagnostics might help with understanding or give you a place to break on in a debugger for the OpenCL code.
But I will gather some statistics for these scenes like execution times, total memory allocated, max segments per partition, max regions involved per partition and maybe we will have some other clues of possible optimisations to make on the code!
For what it’s worth, a holy grail will be getting the BRL-CAD benchmark to complete successfully (and faster!). You can try it by invoking “benchmark run” on the command line to get your baseline performance on the C version. Testing the OpenCL path would simply be: benchmark run -z1

It renders a set of baseline models, keeps track of timings, compares their rendered image with baseline references, and reports if they match. It’s both a performance test and a validation and verification framework. That’s obviously not going to pass now, but they are concrete examples with automated image comparisons, timings, well-known behaviors, etc.

Cheers!
Sean
Marco Domingues
2017-07-25 12:53:56 UTC
Permalink
Post by Christopher Sean Morrison
I fixed the problem with the overlap handler and uploaded the patch in the svn (https://sourceforge.net/p/brlcad/patches/472/). I’ve also uploaded some more new images in my dev log with renders of scenes from the ‘share/db’ directory, which can be found here: https://brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#20-21_July
This is really good looking progress.
At image (216, 219), real RGB=(141 141 141)
At image (213, 201), real RGB=(157 156 157)
At image (257, 224), real RGB=( 33 33 33)
At image (268, 171), real RGB=( 82 82 82)
You can then reissue the exact same rt command (e.g., rt -z1 -l5 …) with the -b "X Y" option (e.g., rt -b "257 224” …) to reshoot just that ONE single ray. That can be much easier to debug.
Note there are also a slew of flags you can set for diagnostic debug printing as well including one specific to the boolean evaluation (-x2000). This can be used by itself or with the -Q X,Y option (note the comma) to reshoot all rays, but turn on debugging for just one specific pixel.
Of course, debugging from the OpenCL side presents a challenge, but seeing the C code diagnostics might help with understanding or give you a place to break on in a debugger for the OpenCL code.
This is very useful! In the past I’ve always found success in resolving bugs by tracking the behaviour for a specific ray. I would do this manually, by filtering a specific id in the kernels, or by redirecting the output to a ‘txt’ file and then analysing it. This tip will make things easier in the future! Thank you!
Post by Christopher Sean Morrison
But I will gather some statistics for these scenes like execution times, total memory allocated, max segments per partition, max regions involved per partition and maybe we will have some other clues of possible optimisations to make on the code!
For what it’s worth, a holy grail will be getting the BRL-CAD benchmark to complete successfully (and faster!). You can try it by invoking “benchmark run” on the command line to get your baseline performance on the C version. Testing the OpenCL path would simply be: benchmark run -z1
It renders a set of baseline models, keeps track of timings, compares their rendered image with baseline references, and reports if they match. It’s both a performance test and a validation and verification framework. That’s obviously not going to pass now, but they are concrete examples with automated image comparisons, timings, well-known behaviors, etc.
I did run the benchmark on the C version. As expecting, the benchmark for the OpenCL path failed, so I guess some things have to be fixed before getting this to work. I will keep this in mind.

Cheers!
Marco
Post by Christopher Sean Morrison
Cheers!
Sean
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
BRL-CAD Developer mailing list
https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Loading...