[SpiROSE] FPGA synchronization architecture

Synchronization signals

The engine may not have cycle accurate speed, the rgb logic will not send data at exactly 33MHz… there are various synchronization issue to tackle. Thus we need some sync signals.
In order to have the simplest and less error prone code, the modules will all rely on a few sync signals, each one output by the relevant module so that the other ones don’t need to worry about his inner working. The framebuffer buffer for instance shouldn’t have to know how the driver controller works, it just needs to know when to send data to it.

The synchronisation signals are the following:

Signal name Output by Used by Role
hall_sensor_trigger hall effect sensor Indicates that we have made half a turn
position_sync hall effect sensor framebuffer, driver controller Indicates that the position has changed
rgb_enable SPI rgb logic Command sent by the SBC to start everything
stream_ready rgb logic framebuffer, driver controller Indicates that enough slices are stored in ram to begin sending data to the driver
driver_ready driver controller framebuffer Indicates that the drivers are ready to receive data (they are configured and we are not in a blanking cycle)
column_ready driver controller multiplexer Indicates that the drivers have latch the data for a column

Modules’ states

The modules will follow those steps:

  1. At reset, the driver controller send the default configuration to the drivers. driver_ready and column_ready will be low so the framebuffer is waiting and the multiplexer doesn’t turn on anything. The rgb_logic waits for rgb_enable to start writing in the ram.
  2. The SPI module receives an rgb_enable command, so it drives the signal high. rgb_logic starts to read the incoming rgb stream, and monitor vsync to ctach the beggining of an image. Then it starts writing in the ram. When enough slices have been written it drives stream_ready high.
  3. The framebuffer and driver controller modules will do nothing before the stream_ready signal. When they receive the stream_ready signal they start a two state behaviour: send data, wait for next slice. In send data state framebuffer listen to the driver_ready signal to send data to the driver controller, driver controller send data to the drivers with the protocol describe in previous posts. When a column has been sent the driver controller drives column ready high for one cycle.
  4. When the multiplexer module receives column_ready it turns on one column only for less than 10µs (so we don’t burn the led when overdriving), then turn it off and wait for the next column_ready assert.
  5. When the whole slice has been displayed (i.e. 8 columns), the frembuffer and driver controller enter wait for next slice state, where they wait for position_sync signal.
  6. The position_sync signal triggers the change from the wait for next slice state to the send data state (go back to step 4)

This is sum up in the following figure:


There is an issue if position_sync changes when we are in the send data state. This can occurs when:

  • There are too many slices, so the time required to send the data exceeds the lenght of a slice
  • The motor was too fast, so the hall_sensor_trigger happens too soon

The first issue can be fixed by using fewer slices, but this means a lower resolution.
The second one is unlikely to happen, as the engine speed is steady.

In case one or the other happens, we handle it this way:

  • Continue to send data for the current column
  • Then reset the relevant counters/signals and start sending the next slice normally

Thus we just finish to send our column before moving to the slice n+1, so in the worst case we lose 7 columns out of 8 in slice n. The slice n+1 is displayed normally but slighlty delayed, worst case by 512 cycles. So if wait for next slice state last for more than 512 cycles the delay is gone when we reach slice n+2, otherwise we end up being in the same situation as before.

[SpiROSE] Led panel assembly and slice format

Led panel assembly

One major issue to design our two PCB was the board to board connection and the led panel position.

In order to have a 80 pixel horizontal resolution with only 40 columns per panel, we need to have the rotation axis between two columns, as the following image shown :

This means two things :

  • The panel axis is not the same as the rotation axis, they are shifted by 1.125mm (half a column), thus the steadiness of the whole must be fixed by adding weight on the rotative base
  • As the panels are meant to be perfectly aligned with one another, the leds and holes has to be symmetrical to the panel axis (not the rotation axis) in order to be face to face with their clone on the other side

For one panel, the connectors are symmetrical to the rotation axis, but when we rotate it by 180°, the connectors don’t align with their clone. This is shown below:

The connector on the rotative base have to be shifted by 2.25mm between the two lines.

The holes on the rotative base also have to be shifted by 1.125mm from the rotation axis to be aligned with the ones on the panel.

Consequences on the framebuffer and slices format

At a given position p, a column on a panel and his clone on the other one display the same voxels. Thus only 40 columns out of 80 are displayed. We have to wait half a turn to reach the position p again, but now the panels have rotated by 180° so the new 40 columns are the ones missing. Therefore we still have our 80 voxels wide resolution.

Hence a slice is displayed in two steps :

  • the even columns first
  • the odd columns half a turn later

This could be handled by the framebuffer module, but would imply to remember at which position we are. In order to remain position agnostic, it is better to let the SBC directly handle this by outputting the slices in an adequate format.

The SBC will cut a slice in two batches, one with the even columns, the other with the odd ones. However, because the panel rotate by 180°, the odd columns need to be reversed to be displayed correctly. All of this is shown in the following figures:

Let’s assume we want to send this slice

At position p we will display the even columns

At position p plus half a turn we will display the odd columns

Thus the SBC will send the slices with the following format

[SpiROSE] Driver protocol and renderer test

Driver protocol

This week I have made a lot of changes in the driver controller code. This module controls all 30 drivers, so it has to send the data, sclk (shift register clock), gclk (displayed clock) and lat commands. To test it we have a PCB with 8 columns of led (7 columns with only one led, and a last column with 16 leds), this simulates what a driver will have to drive : 8 columns of 16 leds, with multiplexing. We have connected a driver to this PCB, and use the DE1-SoC card we have at school as our FPGA. On those cards the FPGA is a cyclone 5, not a cyclone 3 as we plan to use, however the driver controller code is device agnostic so it is not a problem.

The module driver controller is supposed to receive data from another module, called framebuffer, which read the images in the FPGA ram. Thus for the test I wrote a simple framebuffer emulator which send data directly, without reading any ram.

The driver controller is a state machine, it can send a new configuration to the drivers, dump this configuration for debug purposes, do a Led Open Detection, or be in stream mode where it sends data and commands to actually display something. This last state has to be in sync with the framebuffer, thus the framebuffer sends a signal to the driver controller when it starts sending data from a new slice. When the driver controller goes from any state to the stream state, it has to wait for this signal. This signal will also be used by the multiplexing module.

The drivers have a lot of timing requirements, after each lat command sclk or gclk needs to be paused to give time to the driver to latch his buffers, and the data and lat needs to change severals ns before and sclk rising and falling edges. Therefor I added a second clock, two times faster (66 MHz), and used it to generate two clocks at 33MHz, in phase quadrature. One is used to clock the state machine, the other is used to generate sclk and gclk. This means that the lat and data edges will occurs a quarter cycle before or after the sclk/gclk edges, which is enough to respect the timing requirements.

In stream mode, gclk has to be input as a segment of length 2^n. During a segment we have to send all the new data to the 16 leds, and input the right lat commands to latch the buffers. The final command has to be input precisely at the last gclk cycle of the segment. In poker mode we send only 9 bit by leds, which takes 9*48 = 432 sclk cycles. The closest power of two is 512, thus we have 512-432=80 cycles of blanking to put somewhere. It was first decided to do all the blanking at the beginning of a segment, and then stream all the data. However as stated before, we need to pause sclk after each WRTGS command, which are sent every 48 cycles. Fortunately one cycle is enough, thus we have 8 blanking cycles not occurring at the beginning of a segment. So now we have 72 cycles of blanking, then one cycle of blanking every 48 cycles.

This means that the frame buffer has to take into account those blanking cycles. Just skipping one shift all the data, resulting in a weird color mix.


To test all this I wrote two simple demo. One simply lights all the leds in the same color (red, blue or green). The second one lights the led with the following pattern:

A button allow to shift the pattern, resulting in a nice Christmas animation.

You can notice that the colors don’t have the same luminosity. Fortunately this can be control with the driver configuration: each color has a 512 steps brightness control. What it still unclear to me is if the driver simply diminish the power sent to a color, or divide the same amount into the three colors. The current measures we have made seems to suggest the later, as the global amount of power doesn’t change when reducing the green intensity for instance.

Renderer test

The renderer allows us to voxelize an opengl scene. It is still a proof of concept and will soon be turn into a library. To test it, I wrote an sh script that does the following steps:

  • Start the renderer with default configuration to see if there is no error and that the shaders load properly
  • Take a screenshot of the rendering with imagemagick, and check that something is actually displayed by checking the color of the central pixel
  • Start the renderer with a simple sphere, take a screenshot and compare it (with imagemagick) to a reference image to detect any changes
  • Start the renderer in xor and non-xor mode, and compare two screenshot taken at the same time

FPGA architecture

This week we have settle the FPGA architecture in order to choose the right FPGA. Fortunately it seems that this cyclone 3 will do the job. The main issues to tackle were the I/O count, the clocks domains, and the internal memory. So before going into details let’s define some terms:

  • We call image a whole cylinder (i.e. a whole turn)
  • We call slice a 2D image displayed by the led panel at a given position

We plan to use 256 slices per turn. Those slices are embedded into a single image sent by the SBC, but this part is quite flexible : the layout of the slices, the frequency and the blanking can be tuned. This is actually very important, because it allows us to rotate the display just by changing the order of the slices in the image send by the SBC. This means that we don’t need to store a whole image in the FPGA, which would have been impossible with the cyclone 3. Instead we jut need to store a few slices to cross the two clocks domain (RGB clock and drivers clock). So let’s describe our architecture.

General Architecture

Modules’ role and architecture

Parallel RGB

The image sent by the SoM contained all the 128 slices :

The RGB logic module will write the pixels sent by the SBC into the RAM.
Since we will only use 16 bits per LED we drop the least significant bits and just store 16 bits per pixel.

In the RAM we store a whole slice in each µBlock, thus the RGB logic module has to decode the right address for each pixel.
For instance the first pixel from 1 to 80 must be written at address 0 to 79 (if we start at 0), but the following pixels 81 to 160
must be written at address 80*48 to 80*48+79 as they are part of the second slice.

Double frame buffer

From the drivers’ perspective a slice looks like this :

In poker mode, the drivers need the bits of all the 16 LEDs, thus we dont really send a pixel, we send a column of 16 bits
of the previous image (in red). When the drivers have sent those 16 bits we have to send the next ones, thus the frame buffer is read
by 30 columns at a time ! This makes it impossible to write the next slice without destroying relevant data. Thus we need a second buffer.

The double frame buffer contains two buffers of size 80*48 pixels. One store the current slice and send it to the driver controllers while
the other read the next slice from the RAM. When the first buffer has sent all data to the driver controllers it just switch the role of the two
buffers. Whenever this module receives the new position from the encoder it starts to fill the right buffer with the next slice.

It takes exactly 512 cycles (driver_clk, up to 33MHz)  to the drivers to send the data for their 16 LEDs , so with an 8 multiplexing the frame buffer will be read in 512*8 = 4096 cycles. The second buffer will be filled in 80*48 = 3840 cycles < 4096 thus we won’t have timing issue if the frame buffer use the same clock as the driver controllers.

Another solution is to have a double frame buffer of size 128 pixels for each driver, and to use an arbitrator to handle the RAM access
(one simple solution is just to let the buffers read one after another, or broadcast the pixel to every frame buffers and let them keep the
one they are interested in).

Driver controller and driver logic

Each driver controller follow those steps:
– send 1 bit at each clk rising during 432 cycles
– wait for 80 cycles (wait for the 512th cycle)

The driver logic will send the correct LAT commands to latch the data in the driver.

The driver logic is also used to configure the drivers (at reset or when the UART tells him to do so),
it can do so by sending the correct LAT commands and replacing the pixels of the frame buffer with
the configuration data (send by UART or stored inside the module for the reset).

Encoder logic

The encoder sends the new position through SSI3:
– 1 bit per cycle (up to the 16th bit)
– then 1 bit error flag
– then we need to wait for at least 20µs before the next transaction

At 2MHz it means that we get a new position every 28µs.
We want to use 256 position at 30 fps. It means that we have
1/(256*30) = 130µs between each position, thus we don’t have timing issue.

I/O count

The cyclone 3 we choose claims to have 128 I/O, and we need 80. However many I/O are specific (PLL…) or used for configuration, thus the real count is the following:

  • 32 VREFIO, used for reference voltage, since we do single ended I/O we don’t need reference voltage and we can use those I/O
  • 6 simple I/O, they don’t have any particular function
  • 8 resistor reference I/O, we can use them too as regular I/O
  • 8 PLL I/O, those are just for output clocks
  • 58 DIFFIO, most of them can be usedas regular I/O except the one needed to configure the FPGA (We will probably use STAPL which needs the four JTAG pin)

Therefor we have more than 100 I/O available.

[SpiROSE] FPGA and driver inner working

FPGA inner working

This week we have chosen all the components, including the FPGA . It has sufficient memory to store a whole 3D image, which is nice to do avoid synchronization issue. The FPGA role is to receive the voxels sent by the SBC, cross the clock domains, and send the correct voxels and control signals to the drivers.

Driver inner working

The TLC5957 is specially built for high density panel, and his inner working is explained very well by this document . Loosely:

  • There are three buffers, the common shift register, and the first and second GS data latch. A control signal, named LAT, is used to latch data from a buffer to another. There are two input clocks, SLCK for writing data, and GCLK for displaying data.
  • The common shift register is 48-bit wide, and this is where we write a voxel from the outside, one bit at each SLCK rising edge. Then we latch the data into the first GS data shift register, which is 768-bit wide (16 LEDs, 48 bits per LED). When all data have been written into the first GS data latch, we latch it into the second one for display.
  • The trick is that GCLK needs to be input continuously and defined segment of 2^N cycles, the display latch must be done at the end of a segment. This produces an overhead if SCLK and GCLK have the same period (which is our case), because after 768 cycles you have written all the data but you must wait for the 1024th cycle.

16 bits per color causes the bandwidth to be too high, fortunately we have chosen the TLC5957 because it has a poker mode, allowing us to send from 9 to 16 bits per color. We will send 9 bits. However in poker mode we don’t write a voxel in the common shift register, instead we write the 9th bit of all the 16 Leds, then the 8th bit and so on. This means that we need the 16 voxels before writing anything in the driver, this would have to be handled by the fifo.

Therefore in poker mode, we send 432 bits and then wait for the 512th cycle of GCLK, but wait this change the bandwidth calculation of last week ! Now the bandwidth is 31.46 MHz, which is still under the 33 MHz of the driver, we’re saved.

A word about continuous integration

Modelsim and Quartus being proprietary softwares, we can’t use them on a docker image to test our RTL code. Thus we have chosen Verilator, which translates SystemVerilog to SystemC, allowing us to write our tests in the latter.

[SpiROSE] LED driver choice

Last time we discussed how multiplexing reduces power consumption and the number of driver, hence reduces the space constraints. This leads to several consequences:

  • The more we multiplex, the more bandwidth is required from the driver
  • Doing n-multiplexing divides the LEDs’ intensity by n, this can be fixed with a bit of overdrive (we can go up to 8 times the intensity)
  • If a driver controls n rows (or columns) they will be staggered because they are displayed one after another

Thus we made some computations to choose a suitable driver and multiplexing.

The two competitors were TLC5957 and TLC59581, the first one can go up to 33 MHz and send from 9 to 16 bit per color with only 1 buffer per bank, the second one can go up to 25 MHz and send 16 bit per color with n buffer per bank, where n is the multiplexing. The second one is interesting because it could do multiplexing without sending back data thanks to the several buffers.

So for each multiplexing we computed the required bandwidth, the nominal power with and without overdrive, and obtained the following output:


Multiplexing TLC5957 bandwidth (MHz) TLC59581 bandwidth (MHz) Nominal power (W) Nominal power with x8 overdrive (W)
2 6,79 12,07 88,7685 710,148
4 13,58 24,14 44,38425 355,074
8 27,16 48,28 22,192125 177,537
16 54,32 96,56 11,0960625 88,7685
32 108,64 193,12 5,54803125 44,38425


Thus the best trade off is with the 8 multiplexing and the TLC5957 because we can’t do more than 4-multiplexing with the other one.

[SpiROSE] Drivers tradeoff

A lot of effort has been done this week to reduce the cost of our project. Luckily this process also help to refine our needs and design. One example is the LEDs’ driver: with thousands of LED to drive, we first compute that we needed almost a thousand driver, which was a nightmare to control and a money sink. Thus we decide that multiplexing will be the way to go.

This imply to use a FPGA and have several over consequences. If we use 32-multiplexing on a 48 channel driver, we can drive 512 LEDs with one driver. However when 16 LEDs are switched on, the multiplexing causes the 494 other to be turned off, thus this divides the LEDs’ intensity by 32 !  Moreover it makes the refresh slower, and because we rotate at high speed (18.85 m/s tip speed at 30 fps) we may observe an offset due to the driver being too slow.

Thus we have made some computation to know if this offset will be visible:

The driver we intent to use ( works at 25 MHz, so if we refresh the outermost LEDs at 30 fps they will have moved 0.4 mm at the end of the refresh, which is a third of the voxel size. Thus we should not see an offset between the inner LEDs and the outer ones.

So we have a tradeoff (as in any engineer project) between the number of driver (which are expensive) and the LEDs intensity.

SpiROSE: PSSC, organization and yesterday’s presentation

This week we delve into the project and tried to figure out how to organize it, defined a data path and sought suitable components. This lead to a list of PSSC and a bit of panic when we realized how much we have to do. Fortunately we are five, and we will hopefully managed to parallelized many tasks.

We made several PSSC categories to define a clear organization : we separated software and hardware, and in each part made an incremental series of test. Because we won’t have the cards before early December, we will do as much test as we can on the devkits. We plan to test each main part -the fixed base, rotative base, and blades- individually, and then test their interaction.

So the organization is, loosely:

  • The following week will be focused on definitely settle the specifications, components, communication protocols and mechanic layout. Then we can make the PCBs’ schemes and rout them. This bring us to mid-november, where we can command all the components and PCB.
  • As soon as we receive the devkits we can start develop the code and test every part incrementally, hopefully this is done by the end of november.
  • After we receive the main cards, we will run the tests done on the devkits on our PCBs.
  • We will start to assemble the system in early December, when we have all the components for the fixed structure, and begin welding as soon as we have the electronic components. This has to be done before the first demo, mid-December.
  • In parallel we have to develop a software to generate 3D voxel image and animations. The image part as to be ready for the first demo. We also have to develop a simple 3D game, like Tron’s motorcycles, for the third demo.
  • We plan to have a first demo mid December, which will gives us the best -or the worst, if the demo failed- christmas ever. The first demo consist of streaming an image, not an animation, from the PC to SpiROSE.
  • The second demo, the one with beautiful 3D animations, should work in early January
  • The third and final demo is to play a game with SpiROSE as screen. This is planned for mid-january.

We presented this yesterday’s to the class and teachers, and it seems to be reasonable although there are still many things to decide that can have a huge impact on what we planned. Here is what we learned during this presentation:

  • We wanted to use PCIe for the communication between the FPGA and the SBC, this is not possible as the IP are under expensive licenses.
  • We may not need an FPGA if we drop the number of voxels.
  • We considered using LVDS video protocol, however  those kind of iP are not free
  • We have to be careful with the power on the brushes, and thus implement tests
  • We planned to use CAN or Flexray bus between the fixed part and the rotative one, but IrDA or Bluetooth will be sufficient and remove cables which is always nice.
  • Even without a central axe, we will still have occlusion when looking from slightly above SpiROSE.
  • We should try to simulate the POV effect with different layouts, with Blender for instance, in order to see the occlusion problems.

So by the end of next week, all this should be crystal clear and well defined!

SpiROSE, a new POV ROSE project

Here we are, at the beginning of our ROSE journey. A tremendous project is waiting for us to engineer it. As usual, we have been looking for a good pun with ROSE for the project name. Eventually we decided to call it “SpiROSE”, a tribute to HARP, RoseAce and video games dragons.

SpiROSE aims at creating a high resolution 3D POV system ( that can be used as easily as a regular screen, so that princess Leïa can send her hologram to Obi-Wan Kenobi without reading a full spec. To achieve this, the LEDs are distributed across several blades that will rotate at high speed. The goal is to use the full extent of the blades to create a whole cylinder, with no hole in the middle.

A 3D video pipeline needs to be created to provide awesome demo showing the potential of our system. We thought about using it as a sprite bumper viewer (see sprite lamp if you don’t know about bump mapping), then exporting models and animation from Blender and finally adding a plugin to Unity. We may only do a very simple video game using voxel directly if we don’t manage to do better with Unity. Heck, even a simple Minecraft map visualizer would do the trick!

The main subjects are :

  • The layout of the LEDs to create a full cylinder without holes
  • How to use it as a “regular” screen, how to send video data from an external source quickly enough
  • How to achieve all this while keeping a high framerate, response time and resolution
  • How to protect ourselves from the Murphy’s law

For now, our main interrogations are focused on the mechanical parts of the project. We are trying to remove the axial part that can be found on HARP or RoseAce, and move to a rotating cylinder style instead. Besides, we are trying to find other layouts for the LEDs, but occlusion is quite a challenge in this quest. We hope that people with more experience and a better physics understanding could explain us more about it, as we are only computer-engineers-to-be.

We also made rough approximations in order to have an idea of the required bandwidth. This confirmed it will be yet another challenge to tackle. We will give more details during the week as soon as we are sure about the results.

We are really thrilled to make this project as we have seen how awesome previous POV project were and we hope everybody will have a good experience during the four following months working on their ROSE project !