Categories

[SpiROSE] Continuous integration and the FPGA story

This week, we put a conscientious effort in making the project eventually start its journey in a more peaceful way. We were quite pressured by the choice of the FPGA and SBC as it was impeding the rest of the project and its existence. Now it has been fixed, we can really start to think about the next tasks: developing software and tests.

Even if it is still not perfect, we spent some time in configuring the continuous integration, designing the different steps for each piece of our software. As I spent most of the week doing code review (and FH work about happiness at work, which made me happy about ROSE), I highlighted some points where it failed, like some commits accepted by the tests although some files were missing, or incorrect style not failing the CI, or that the file organization in some subprojects weren’t practical for the integration of tests.

Two things are still missing. On the first hand, we have to save the artifacts so as to show demo easily, but we need more code so that it is useful and easy-to-follow manuals about how to use these artifact (flashing, running configuration, etc.) We expect to have this quickly for the LCD screen and renderer demos. On the other hand, we still don’t have very precise tests on our software, but things are to change quickly as we are writing simulation code in SystemC for the FPGA.

The idea, given by our teachers, is to use Verilator, a SystemVerilog-to-C++-library compiler, to put our FPGA controller into an environment with a simulated SBC input and simulated drivers so as to produce an output image. We will be able to cross-check the protocol used by the drivers in poker mode and check timing constraints. The SBC will generate sample image in the future and we will be able to check that it is correctly rendered by the FPGA.

To use Verilator, we have to:

  • compile each .sv file into a library (or only source/header files)
  • create a top file in SystemC which integrates each piece that we connect to the FPGA
  • monitor the input and output of the FPGA through SystemC stub modules

Eventually, we have to find a way to do the same on the real FPGA so as to have host integration tests.

We received the LED and the driver on Friday, so we didn’t try another experiment of the POV effect, but it will come quickly this week.

[SpiROSE] FPGA and driver inner working

FPGA inner working

This week we have chosen all the components, including the FPGA . It has sufficient memory to store a whole 3D image, which is nice to do avoid synchronization issue. The FPGA role is to receive the voxels sent by the SBC, cross the clock domains, and send the correct voxels and control signals to the drivers.

Driver inner working

The TLC5957 is specially built for high density panel, and his inner working is explained very well by this document . Loosely:

  • There are three buffers, the common shift register, and the first and second GS data latch. A control signal, named LAT, is used to latch data from a buffer to another. There are two input clocks, SLCK for writing data, and GCLK for displaying data.
  • The common shift register is 48-bit wide, and this is where we write a voxel from the outside, one bit at each SLCK rising edge. Then we latch the data into the first GS data shift register, which is 768-bit wide (16 LEDs, 48 bits per LED). When all data have been written into the first GS data latch, we latch it into the second one for display.
  • The trick is that GCLK needs to be input continuously and defined segment of 2^N cycles, the display latch must be done at the end of a segment. This produces an overhead if SCLK and GCLK have the same period (which is our case), because after 768 cycles you have written all the data but you must wait for the 1024th cycle.

16 bits per color causes the bandwidth to be too high, fortunately we have chosen the TLC5957 because it has a poker mode, allowing us to send from 9 to 16 bits per color. We will send 9 bits. However in poker mode we don’t write a voxel in the common shift register, instead we write the 9th bit of all the 16 Leds, then the 8th bit and so on. This means that we need the 16 voxels before writing anything in the driver, this would have to be handled by the fifo.

Therefore in poker mode, we send 432 bits and then wait for the 512th cycle of GCLK, but wait this change the bandwidth calculation of last week ! Now the bandwidth is 31.46 MHz, which is still under the 33 MHz of the driver, we’re saved.

A word about continuous integration

Modelsim and Quartus being proprietary softwares, we can’t use them on a docker image to test our RTL code. Thus we have chosen Verilator, which translates SystemVerilog to SystemC, allowing us to write our tests in the latter.

[SpiROSE] Moar render

SoM – Finally the final choice

This week, we finally chose an appropriate SoM along all those I did find. Well, actually, only one did match the requirements :

  • Onboard Wifi
  • GPU
  • Parallel RGB and/or fast GPMC-like interface
  • No “contact us” bullcrap to get one

As it turns out, those criterias are so specific that only the WandBoard did match our requirements. The wifi kicked the FireFly boards out, GPU removed all FPGA SoC, Parallel RGB removed RK3399 based boards. Fast GPMC removed all Gumstix ones, and the “contact us” threw away Variscite and Theobroma.

Phew. Thus it has been ordered on friday. We expect it to arrive as soon as possible.

Rendering – pipeline mostly done

I also refined the rendering side of things, by generating the image that would be sent by the SBC to the FPGA.

In the above picture, the bottom left corner still has the voxel texture, but also the end goal, with the white things. Those are the 32 slices of the voxelized suzanne along a vertical plane.

Those slices (and micro images) are effectively a refresh of the rotating panel each. Thus, the FPGA will only have to cherry pick the proper subpart of the image to refresh the LED panel.

This sliced version is generated in a pixel shader from the voxel texture (the blue thing on the bottom left).

Next week

We’ll (hopefully) get to play on the SBC. Despite having an Athens week, I’ll try to port this renderer to OpenGL ES, which may not be very trivial.

SpiROSE: LEDs place and route

My week was dedicated to do the placement tests for the LED PCBs. The schematic for one ‘driver column’ (meaning the number of columns driven by a set of drivers to make a repeatable set) have been done using 3 drivers per ‘driver column’. The placement is:

  • 2 drivers on the top of the circuit, with a ground plane under
  • 3 ‘micro-planes’ of LEDs (a micro-plane is 8*16 LEDs in 8-multiplexing, with column multiplexing)
  • 1 driver on the bottom of the circuit, with a ground plane under

There have been multiple trials for the place and route, the pictures retrace the big steps.

This is the last version on the sunday morning.

Part of a driver being routed.

LEDs main view.

Schematic of 2/3rd of one ‘micro-column’

Details on the routing of the LEDs.

 

The components are all in the top layer, so that we can put two PCBs back to back without any problem (with an isolation layer between the two to avoid short-circuits).

The place and route step of the LEDs PCB is nearly finished, it should be done before the end of the week.

[SpiROSE] Yummy voxels

Howdy!

This week, we finally fixed the LED count on our panel, though it might get modified due to PCB routing constraints. Anyways, we got away with a 83×46 display.

I also polished the renderer, and got the voxelisation working. It runs real time on the GPU with a single pass. I will now describe how this voxelization works, and show some results.

Voxelization

In my previous post, I mentioned a paper that shows a technique to voxelize an OpenGL scene in a single pass. To simplify, I will explain it with a desired “resolution” of 8x8x8, with the scene in an cube from (-1, -1, -1) to (1, 1, 1) (in OpenGL units). We represent the voxels using the bits of a texture. There we’ll need a 8×8 texture with 8 bits per pixel (thus grayscale). Each pixel represents a column, where each bit represents a voxel : a set bit means there is a voxel, while an unset one means that there are no voxel. The least significant bit represents the lowest voxel on the z axis, while the most significant one represents the highest voxel on the z axis. To know whether we have a voxel at OpenGL coordinates (x, y, z), we map each coordinate to the [[0, 8]] interval and look at the zth bit of the (x, y) pixel.

Now, voxelization. For this, we need a fragment shader. For starters, a fragment shader is a little program that runs on the GPU for each drawn pixel (a fragment) after rasterization of a triangle, that outputs the final color of the said fragment. For the same pixel, there can be multiple fragments : when several triangles get on top of each other. This shader can know about multiple properties, including the position in camera space of the fragment. By using an orthographic projection from the bottom (with the appropriate clipping planes), our xyz coordinates are unchanged and are the same in both camera space and world space.

To get the fragment color, we map the z coordinate of the fragment from [-1, 1] to [[0, 8]] (integer). This gives us the proper bit to set. We then set the corresponding bit, and all bits lower than this one. This gives us the final color of the fragment.

Courtesy of the aforementioned paper.

Now, we tell OpenGL how to combine our fragments. This is done through the XOR blending mode. When taking two fragments, OpenGL will apply a bitwise xor and use the resulting values. When two fragments overlap, only the bits between them will remain set. If the mesh is watertight, we will get an alternation of bits after each fragment encounter. Thus we get the same result as a scanline algorithm, without costly loops.

Now, to the realtime rendering. This time, the voxelization is done with a 32x32x32 resolution. To get additionnal bits per pixel, I simply used each pixel channel. Red is the bottom 8 layers, then green, then blue, then alpha is the top 8 voxels.

Voxelized suzanne w/ pizza transform

Voxelized suzanne w/o pizza transform

 

 

 

 

 

 

 

You may notice that the first one is crying. This is due to the suzanne mesh being lame : it is not watertight at the eyes, and produces some glitches which I managed to avoid for the second one.

Also, notice that on the bottom-right is the direct output of the voxelization pass. For those screenshot, a second pass was needed to visualize the result, as the raw output is hard to parse for our eyes.

Back to the pizza

You may notice that I posted a screenshot with the pizza transform (that gets then reversed in the second visualisation pass). Here is a screenshot outlining the benefit of it.

Thanks to the colors, you may be able to see all the radiuses from the center made by the voxels. These exactly depict a refresh from our rotating panel. Each refresh is a “radius slice”, which maps to a pixel column in our voxel image.

 

Outside voxels may seem extremely stretched, but this is because the transformed geometry was rendered on a 32×32 texture, giving a 32 voxels resolution along the radius, and along the perimeter. This is equivalent of having 32 refreshes from our rotating panel, which is, obviously, way too low.

However, as interesting as this transform is, it does require geometry shaders, which is in OpenGL ES core 3.2. That does drastically limit our SBC choice. Yet, some SoC do support the extension on lower versions of GL ES, since this is a very useful feature, and they may pack it without all the bells and whisles of GL ES 3.2. Note that all the above voxelization does not require any modern OpenGL features. Even GL ES 1.0 hardware can do it. For reference, the authors of the paper were rocking commodity 2008 GPUs.

Data streaming

The very first requirement of this project was to be able to stream a video from a computer to SpiROSE. However vague it may be, there are quite many steps before getting a 3D video, and data the FPGA can understand. However, this is neither the only solution nor the most interesting one. Many use cases are present:

  • We have a 2D video on a computer (Big Buck Bunny for a change). When streaming it to the display, we somehow need to project it. Be it wrapping around on a cylinder, horizontally on a single layer, or vertically on a random plane (easiest). This would still require some software on the SBC, whose job would be to translate this 2D stream into a proper thing for the FPGA.
  • We have a 3D scene. So many things can be streamed, in so many steps in the rendering pipeline.
    • Streaming the inputs. This is essentially sending the scene/mesh/… to the SBC, with it rendering in 3D and generating images for the FPGA. There the computer does nothing, except getting user input to manipulate the render. A typical application would be a game, where SpiROSE is an arcade machine.
    • Streaming the cuts. On the PC, the 3D scene would be arranged : all usual 3D transforms applied (translations, rotations, …). Then, n slices would be done along a vertical plane, each representing a refresh of the panel. This gives us a set of n 2D outlines. The resulting cut geometry would be filled and triangulated, then sent to the SBC. It would then rasterize it and forward the result to the FPGA.
    • Streaming the end render. The computer would do all the heavy lifting and generate an image stream that the FPGA can understand. Compress it, stream it, run gstreamer on the SBC, and you’re done!

Each of those have their advantage and drawbacks. The 2D one is limited, but is trivial to use. Onto the 3D scene, the first is the easiest one on bandwidth. However, we are limited by what is programmed into the SBC, just like an arcade machine is locked to a single game; but this may also be an advantage since SpiROSE can run on its own, with the ability to be interacted with.

The second option looks really nice. However, the cutting thing is CPU-only as the resulting geometry shall be sent to the SBC. That means it will be hard to run it on the SBC, and impossible to run on a GPU. However, it is really light on bandwidth and on onboard computations. But it also forbids streaming any kind of bitmap (2D video).

Last option is really nice, since we can record a video of the output, and simply stream it as with the first 2D option. However, bandwidth is a real concern, and compression might end up … messy, to say the least. The issue is the hardware decoder of a SBC, that is incapable of pushing more than 60 frames per second, which mean we cannot encode a panel refresh as a video frame : we need to multiplex them on a single video frame. However, video codecs really don’t like discontinuities, and 256 seemingly independent streams on a single frame is too much for them. Either the final size is larger than the raw video, or everything gets blurred out. Moreover, realtime H264/H265 compression is not a good idea, since those codecs may do a lot of backannotations. For proper compression, we’d have ~1s of delay added, which is way too much for, say, a game.

So, we still have to decide which route to go (well, 2D video is kinda mandatory).

SBC / FPGA communication

Last week, I spoke about HDMI -> parallel RGB bridges. These chips have an issue, being the low information availability about them. It is pretty hard to tell whether the chip will output bursts of information when an HDMI frame comes in, or if it will buffer it to output a slower, steadier data stream. This matters, because routing 24 traces @168MHz is not exactly fun. This is why we are exploring 2 routes :

  • SBC with integrated RGB output (aka MIPI-DPI). Since the SoC generates the signal, it will be much easier to control its timing. For example, the i.MX6 SoC is very flexible on it. (it is the only one I had time to analyse, as this kind of information is hard to find).
  • Some kind of memory interface (GPMC/other), the same way ROSEace did. However, those interfaces are harder and harder to find, where only the Gumstix SBCs have one, but it is too slow. The only other SoC (I found) still offering a similar interface are the i.MX6 series, with their EIM (External Interface Module). Problem is, this kind of interface is becoming obsolete and being replaced by PCIe. But that’s out of the question.

TODO

Next week, I’ll continue analysing SoCs to find one with a flexible RGB interface, that could keep signals not too quick (hello signal intergrity).

I will also continue that renderer, where I’ll interlace the resulting voxelized output, to get a mosaic of panel refresh : a single video frame being a whole SpiROSE frame, embedding 256 LED frames.

See you next week 🙂

[SpiROSE] POV test on LED, need more experiments

On Wednesday, we tried to apply the experiment protocol about testing the LED’s POV effect with a rotating plane. We used a LED matrix taped on a cardboard that we fixed on an axis. The assembly was not perfect but might have validate the protocol, or at least the fact that we could use a camera to have better ideas on the POV effect. Actually, the second point was a success, but we couldn’t validate anything more.

Here is a picture of the assembly:

And here a picture of the result we got (5” exposition, 50mm focale, 3200iso, F/2.5 aperture):

There are two points to observe. On the first hand, the LED matrix is not centered on the axis, and has ~2cm of eccentricity. On the second hand, there are no shades on the middle of the result, so we couldn’t even test with diffusers. In fact, the fact that we have some eccentricity moved the problem to the edge of the visible arc.

However, we didn’t have any problem to see the matrix from a little less than 90° (perpendicular to the emission direction of the LED) so working on this experiment might be more complex than expected. We will try another experiment using the LED we will use for the project and flattening the panel.

This week, we will start the schematics and finalize to define what kind of data we send to the SBC, as it has become difficult to find SBC with the requirements for everything we excepted.

SpiROSE: Architecture review

This week was about choosing the main components and reviewing the architecture of the project. A big snapshot of the project have been done on 2017-11-03, available here.

Review of the architecture

The architecture looks like this:

(Errata, the original one is in the snapshot slides)

On the fixed structure, a 3-phase brushless motor (with reductor) is controlled by a speed controller itself driven by an STM32F7 DevKit we already have, that receives an estimation of the speed via an Hall effect sensor. This DevKit also have a Human Machine interface and communicates to the Single Board Computer (we’re still looking for…) using a WiFi connection. The SBC is in the Rotary part. It communicates to the FPGA (probably using an HDMI to RGB chip, we’re still looking for high bandwidth outputs on the SBCs to avoid that) and a SPI+UART connection let us have a low-speed communication between both. The SBC also has GPIOs for programming the FPGA.

The FPGA receives the RGB stream and splits it to the ~30 8-multiplexed LED drivers that drives the LEDs. The FPGA is managing the multiplexing using another Hall effect sensor.

 

What’s next

The FPGA and the SBC are not fully chosen for now. We are still looking for boards with good output bandwidth that fit our requirements (like, not PCIe because of impedance matching issues, and not GPMC as there is no board available anymore). The other parts are chosen.

 

Next week we have to choose the SBC and FPGA.

[SpiROSE] LED driver choice

Last time we discussed how multiplexing reduces power consumption and the number of driver, hence reduces the space constraints. This leads to several consequences:

  • The more we multiplex, the more bandwidth is required from the driver
  • Doing n-multiplexing divides the LEDs’ intensity by n, this can be fixed with a bit of overdrive (we can go up to 8 times the intensity)
  • If a driver controls n rows (or columns) they will be staggered because they are displayed one after another

Thus we made some computations to choose a suitable driver and multiplexing.

The two competitors were TLC5957 and TLC59581, the first one can go up to 33 MHz and send from 9 to 16 bit per color with only 1 buffer per bank, the second one can go up to 25 MHz and send 16 bit per color with n buffer per bank, where n is the multiplexing. The second one is interesting because it could do multiplexing without sending back data thanks to the several buffers.

So for each multiplexing we computed the required bandwidth, the nominal power with and without overdrive, and obtained the following output:

 

Multiplexing TLC5957 bandwidth (MHz) TLC59581 bandwidth (MHz) Nominal power (W) Nominal power with x8 overdrive (W)
2 6,79 12,07 88,7685 710,148
4 13,58 24,14 44,38425 355,074
8 27,16 48,28 22,192125 177,537
16 54,32 96,56 11,0960625 88,7685
32 108,64 193,12 5,54803125 44,38425

 

Thus the best trade off is with the 8 multiplexing and the TLC5957 because we can’t do more than 4-multiplexing with the other one.

[SpiROSE] Pizza

Last week was … eventful. After getting turned down again and again by the mechanic, we finally arrived at a design that might work. Simply put, a stack of ROSEace won’t cut it (haha), but a big ol’ plate à la HARP might work. At least the mechanic is okay with it ¯\_(ツ)_/¯

Renderer

Anyways. I searched for algorithms suitable for the renderer we intend to write. Its job would be double :

  • Voxelize a 3D scene
  • Apply what I will call the Pizza Transform™ from now on

Voxelization

To my surprised, the voxelization is really easy to do, even on a GPU. I found a paper performing real time voxelization (Single-Pass GPU Solid Voxelization for Real-Time Applications by Elmar Eisemann). It is based on OpenGL Xor blending mode. Wonderful, we’ll have a renderer that will work even on complex OpenGL scenes NOT meant to be voxelized. The only constraint is for the scene to be mathematically watertight (along one axis is enough).

Pizza Transform

Now, to the Pizza Transform™. Remember that we have a circular display, where voxels are not square but round-ish rectangles ? It would be better if we could “unroll” the circular image into a nice cartesian matrix. But first, why’d we want this ?

Think about ROSEace. What they did is lay down a square image on their circular display. Think a tablecloth on a round table. Then, for each blade position, they took the pixel just under each LED. This is pretty ineffective, as well as inaccurate : you waste data by not using the corners, and you risk using the same pixel twice for two different voxels.

Even by using a smarter yet harder way, and not refresing all LEDs on the blade at the same time to keep each voxel roughly the same length, you still end up wasting space and reusing pixels.

This is a simulation done using Python and Excel for a 88×88 image. Gray pixels are unused pixels, green are used exactly once, and red are used more than once. On the left is the simple way (ROSEace), right is the more complicated one. Both waste roughly the same amount of pixels (circa 32%), and neither has a perfect 1:1 mapping of a voxel to a texel.

Enter the pizza transform. Its name derives to a simple way of explaining what we want to acheive. Keep in mind we are making a renderer, so output images may not necessarily be cartesian, and we have unlimited resolution on the input.

Take a pizza (we have a 3D model of it). You may want to display it in its glorious 3D :

However, bandwidth is sparse and you don’t want to waste anything, especially not any details by missing stuff and replicating voxels ! Take a knife, cut it along a radius (say, from middle to bottom). Now, stretch it into a rectangle with crust only on one edge :

That’s our transform. Back to the renderer. We only need to do this with our whole OpenGL scene and voxelize it. The first half is kinda done, as I wrote a Pizza Proof of Concept, that generates the cut and transform an OpenGL scene in real time, using a single Geometry Shader.

On the rightmost half of the middle circle, you can see a vertical line. This is our vertical cut that allows us to unwrap the mesh into what we have on the top right. The result may look weird, but it is due to the center of the circle not being at the origin, where the cut happens. Note that the render is in wireframe only to show the cut.

SBC – FPGA

Yes, we are using an FPGA after all. Who would have believed it to be easier ? Anyways, Ethernet and the likes are out of the question.

Wait … We’re doing GPU rendrering … Straight out taking the video output to the FPGA would be perfect ! Oh, any IPs for an HDMI, DisplayPort or the likes would cost an arm, a leg and your cat 🙁 Luckily for us, the RGB protocol is trivial to use on an FPGA (think digital VGA), and event luckier, HDMI <-> RGB bridges ICs do exist, like this one from Ti. One problem solved !

Now

This week, I’ll try to smooth out the renderer PoC and even add voxelization in it.

We will also make a final decision on the components we will use, especially on the SBC and the FPGA. Speaking SBC, T.G. did suggest SoMs from Variscite based on NXP i.MX6 SoCs. Any experience, pros and cons about those ?

We plan on using this dev board from ST (I have one and I know Alexis to have one too) to have an embedded display with touch screen interface to control some aspects of the display. I’ll setup a base graphical project for it using ChibiOS and µGFX.

See you next week ! Or even before, any feedback is welcome !

[SpiROSE] Reducing the costs

As SpiROSE had slightly shifted towards a rotating plate design, we had to list of the upcoming purchases that will be needed for the project. Yet, it quickly turned out that the costs were consequent after the first dimensions of the project were discussed.

Since we wanted – and still strongly desire ! – to have an excellent resolution for our screen, and given the maximum physical dimensions of the rotating plate stated by the mechanic, we ended up with a staggering number of almost 9000 LEDs, which represented a third of the total costs ! Other configurations were simulated, where the LED amount was decreased, significantly decreasing the costs. Having fewer LEDs implies that, if the size of the plate remains the same, we lose some resolution (but it also means that we have fewer traces to route and fewer components to solder). As regards the LED drivers, as Adrien explained, the multiplexing capability of the drivers allows to significantly shrink their numbers, allowing huge savings.

We then have to optimize each component with regard to its cost.

  • For some electronic components, we can benefit from the free sample policy from many electronic distributors, the LED drivers are planned to be bought this way.
  • The cost of SpiROSE structure can be alleviated by diminishing the thickness of the metal plates and bars needed to build it. The least stressed parts of the structure can be thinner, and since we buy a certain volume of metal, we can again reduce the costs, without endangering the integrity of the SpiROSE.
  • PCBs are a consequent part in the final costs of the project, so the smaller the cheaper. If we manage to shrink its sizes, with regard to the rotary base and the fixed base ones, we again shrink the costs. Another way for the LED PCB would be to buy small PCBs and assemble them together, since it is often cheaper to buy many small ones than a large ones, if they have the same total area. But finding an easy and solid way to perform it is not easy task. In our cheapest simulation, we still have to solder more than 2000 LEDs, which can be performed by the PCB manufacturer, but it has a price. We are discussing whether or not it might be feasible and reliable.
  • Being also expensive, the motor will be chosen carefully. It requires sufficient torque to have the rotation initiated and to counterbalance the air friction , as well as to withstand the speeds we are aiming to reach for the rotary panel.
  • Finally, we can change the model of the SBC and/or the FPGA to choose the one that fits exactly our needs. But it is risky to do it now, if ever we underestimated our needs , then this optimization would be pointless.

The exact dimensions of SpiROSE will be fixed early this week. The upcoming week, we will be working on the LCD screen, the tests with the LEDs and hopefully begin designing the electrical components 🙂